# Parsing output of AF3 from the cluster
created by Andreas 2025-02-04

Script to parse output from AF3 running on the IMB server cluster. Also detects errors on the runs using the reports created by Nextflow

This notebook supports to be run cell by cell from top to bottom. Runtime was around 35 minutes, but most (I estimate 90%) is comming from accessing the files on a network drive

### 0 Settings + Imports

Run the following lines and edited in the second cell the necessary paths / settings

Note: In this run the known_extensions have been excluded. To include them, change the code in the _# Loading the output folders_ cell in this section and the first cell of section 3

In [1]:
# Imports
from pathlib import Path
import pandas as pd
import numpy as np
import re
import json

import pymol
from Bio.PDB import PDBParser
from Bio.PDB.Structure import Structure as BioPy_PDBStructure
from Bio.PDB.Model import Model as BioPy_PDBModel
from Bio.PDB.PDBExceptions import PDBConstructionException
parser = PDBParser(QUIET=True)

ressources_path = Path("../ressources").resolve()


In [2]:
# Settings

# The base folder of the AF output. The AF3 files are searched inside /Alpha
luck_drive_folder = Path("L:/imb-luckgr2/projects/AlphaFold") 

# The folder to export the output
export_destination = Path(r"D:\Eigene Datein\dev\Uni\JGU Bio Bachelorthesis\Daten\resources\AF3")

# Set this option to skip existing structures
export_skip_existing_structures = False

In [None]:
# Loading the output folders
# Note: The known_extension structures should not be considered, so exclude them explicitly
af3_runs_folder = luck_drive_folder / "AlphaFold3"
DMI_folders = [p for p in (af3_runs_folder / "AlphaFold_benchmark_DMI").iterdir() if p.is_dir() and "known_extension" not in p.name]
DDI_folders = [p for p in (af3_runs_folder / "AlphaFold_benchmark_DDI").iterdir() if p.is_dir()]
benchmark_folders = DMI_folders + DDI_folders

# Load also the AF2 folder to get the original input files without switched chains to undo the chain switching
af2_folders = { "known_minimal": luck_drive_folder / "AlphaFold_benchmark_DMI" / "run37", # DMI known minimals
                "random_minimal": luck_drive_folder / "AlphaFold_benchmark_DMI" / "run38", # DMI random minimals
                "mutations": luck_drive_folder / "AlphaFold_benchmark_DMI" / "run43", # DMI mutations
                "known_ddi": luck_drive_folder / "AlphaFold_benchmark_DDI" / "run5", # DDI known DDI
                "random_ddi": luck_drive_folder / "AlphaFold_benchmark_DDI" / "run6", # DDI random DDI
}

print("Folders with AF3 output:")
for p in benchmark_folders:
    if not p.exists():
        print(f"{p} does not exist")
    else:
        print(p)
        
print("Folders with AF2 output")
for p in af2_folders.values():
    if not p.exists():
        print(f"{p} does not exist")
    else:
        print(p)

Folders with AF3 output:
L:\imb-luckgr2\projects\AlphaFold\AlphaFold3\AlphaFold_benchmark_DMI\known_minimal
L:\imb-luckgr2\projects\AlphaFold\AlphaFold3\AlphaFold_benchmark_DMI\random_minimal
L:\imb-luckgr2\projects\AlphaFold\AlphaFold3\AlphaFold_benchmark_DMI\mutations
L:\imb-luckgr2\projects\AlphaFold\AlphaFold3\AlphaFold_benchmark_DDI\known_ddi
L:\imb-luckgr2\projects\AlphaFold\AlphaFold3\AlphaFold_benchmark_DDI\random_ddi
Folders with AF2 output
L:\imb-luckgr2\projects\AlphaFold\AlphaFold_benchmark_DMI\run37
L:\imb-luckgr2\projects\AlphaFold\AlphaFold_benchmark_DMI\run38
L:\imb-luckgr2\projects\AlphaFold\AlphaFold_benchmark_DMI\run43
L:\imb-luckgr2\projects\AlphaFold\AlphaFold_benchmark_DDI\run5
L:\imb-luckgr2\projects\AlphaFold\AlphaFold_benchmark_DDI\run6


### 1 Scanning input .json files and report.html files
Scans for all input .json files and corresponding report_%time%.html files to find failed runs

* **benchmark_set** refers to the pairing method (mutated, randomized, ...) and is equal to the folder name (example: known_minimal)
* **prediction_name** is extracted from a) the json file or b) the report_%time%.html file. If the value can't be extracted from report_%time%.html, it is set to None
* **report_file** refers to the name of the report_%time%.html file and is None if the json input file could not been matched with a report file (--> input json has not been run on the cluster \[yet\])
* **run_ok** refers to if there had been an error running the input file on the server. Set to None if the input file has not been run on the cluster.
* **input_json** is set to the filename of the input json or to None if the report_%time%.html could not be matched with a input json file.

Use <i>report_file == None</i> or <i>run_ok == None</i> to identify scheduled but not yet run structures.<br>Use <i>input_json == None</i> to find runs without a input file

In [5]:
# Scanning input and report files
report_df = pd.DataFrame(columns=["benchmark_set", "prediction_name", "report_file", "run_ok", "input_json"])
for folder in benchmark_folders:
    benchmark_set = folder.name
    print(benchmark_set)
    # Ignore the f.is_file() check in the following lines as it will decrease performance massively on a network drive (example runtime: 0.9s --> 1m 25.6 s (!))
    nextflow_inputs = [f for f in folder.iterdir() if f.suffix.lower() == ".json"]
    for nextflow_input in nextflow_inputs:
        prediction_name = nextflow_input.stem
        report_df.loc[len(report_df)] = {"benchmark_set": benchmark_set, "prediction_name": prediction_name, "report_file": None, "run_ok": None, "input_json": nextflow_input.name}

    for p in [f for f in folder.iterdir() if  "report_" in f.stem and f.suffix.lower() == ".html"]:
        print("\t", p.name, end=" ")
        with open(p) as f:
            content = f.read()
        prediction_name = x.groups()[0] if (x := re.search(r"\(\[id:\[([\w\-\.]+)\], jobsize:\d+\]\)", content)) is not None else None
        print("->", prediction_name)
        finished = bool("Workflow execution completed successfully!" in content)
        num_report_df = len(report_df.loc[np.logical_and(report_df["benchmark_set"] == benchmark_set, report_df["prediction_name"] == prediction_name), ["prediction_name"]])
        if num_report_df == 0:
            print("\t\tNo input json file")
            report_df.loc[len(report_df)] = {"benchmark_set": benchmark_set, "report_file": p.name, "prediction_name": prediction_name, "run_ok": finished, "input_json": None}
        elif num_report_df > 1:
            print(f"\t\tMultiple reports for same prediction")
            report_df.loc[len(report_df)] = {"benchmark_set": benchmark_set, "report_file": p.name, "prediction_name": prediction_name, "run_ok": finished, "input_json": None}
        else:
            report_df.loc[np.logical_and(report_df["benchmark_set"] == benchmark_set, report_df["prediction_name"] == prediction_name), ["run_ok"]] = finished
            report_df.loc[np.logical_and(report_df["benchmark_set"] == benchmark_set, report_df["prediction_name"] == prediction_name), ["report_file"]] = p.name

known_minimal
	 report_2025-02-05_13-18.html -> LIG_HOMEOBOX_1B72
	 report_2025-02-05_13-35.html -> DOC_SPAK_OSR1_1_2V3S
	 report_2025-02-05_13-51.html -> DOC_USP7_MATH_1_3MQS
	 report_2025-02-05_14-08.html -> DOC_USP7_MATH_2_1YY6
	 report_2025-02-05_14-25.html -> DOC_USP7_UBL2_3_4YOC
	 report_2025-02-05_16-00.html -> DOC_USP7_UBL2_3_4YOC
	 report_2025-02-05_16-18.html -> LIG_14-3-3_ChREBP_3_5F74
	 report_2025-02-05_16-36.html -> LIG_ActinCP_TwfCPI_2_7DS2
	 report_2025-02-05_16-54.html -> LIG_Actin_RPEL_3_2V51
	 report_2025-02-05_17-18.html -> LIG_ActinCP_CPI_1_3AA0
	 report_2025-02-05_17-36.html -> LIG_Pex14_3_4BXU
	 report_2025-02-05_17-51.html -> LIG_PROFILIN_1_2V8C
	 report_2025-02-05_18-08.html -> LIG_PTAP_UEV_1_1M4P
	 report_2025-02-05_18-24.html -> LIG_PTB_Apo_2_1NTV
	 report_2025-02-05_18-41.html -> LIG_Rb_LxCxE_1_1GH6
	 report_2025-02-05_18-58.html -> LIG_Rb_pABgroove_1_1N4M
	 report_2025-02-05_19-15.html -> LIG_REV1ctd_RIR_1_2LSI
	 report_2025-02-05_19-32.html -> LIG_RPA_C_Ve

In [6]:
# Displaying output
num_input = len(report_df[~report_df['input_json'].isna()])
num_output_with_input = len(report_df[np.logical_and(~report_df["input_json"].isna(), ~report_df["report_file"].isna())])
num_output_total = len(report_df[~report_df['report_file'].isna()])
num_ok_with_input = len(report_df[np.logical_and(~report_df["input_json"].isna(), report_df["run_ok"] == True)])
num_fail_with_input = len(report_df[np.logical_and(~report_df["input_json"].isna(), report_df["run_ok"] == False)])
num_ok = len(report_df[report_df["run_ok"] == True])
num_fail = len(report_df[report_df["run_ok"] == False])

print(f"{num_output_with_input}/{num_input} of the scheduled structures have finished. {num_ok_with_input} were successful and {num_fail_with_input} failed")
if num_output_total != num_output_with_input:
    print(f"There are {num_output_total - num_output_with_input} reported runs which could not be identified. {num_ok - num_ok_with_input} of them were successful and {num_fail - num_fail_with_input} failed")
print(f"Benchmark sets: {set(report_df['benchmark_set'])}")
report_df

636/636 of the scheduled structures have finished. 636 were successful and 0 failed
Benchmark sets: {'known_ddi', 'random_minimal', 'known_minimal', 'mutations', 'random_ddi'}


Unnamed: 0,benchmark_set,prediction_name,report_file,run_ok,input_json
0,known_minimal,LIG_HOMEOBOX_1B72,report_2025-02-05_13-18.html,True,LIG_HOMEOBOX_1B72.json
1,known_minimal,DOC_SPAK_OSR1_1_2V3S,report_2025-02-05_13-35.html,True,DOC_SPAK_OSR1_1_2V3S.json
2,known_minimal,DOC_USP7_MATH_1_3MQS,report_2025-02-05_13-51.html,True,DOC_USP7_MATH_1_3MQS.json
3,known_minimal,DOC_USP7_MATH_2_1YY6,report_2025-02-05_14-08.html,True,DOC_USP7_MATH_2_1YY6.json
4,known_minimal,DOC_USP7_UBL2_3_4YOC,report_2025-02-05_16-00.html,True,DOC_USP7_UBL2_3_4YOC.json
...,...,...,...,...,...
631,random_ddi,D1PF14447_PF00179_3ZNI.D2PF14978_PF00327_5OOL,report_2025-02-11_07-40.html,True,D1PF14447_PF00179_3ZNI.D2PF14978_PF00327_5OOL....
632,random_ddi,D1PF14978_PF00327_5OOL.D2PF15985_PF10175_6D6Q,report_2025-02-11_07-59.html,True,D1PF14978_PF00327_5OOL.D2PF15985_PF10175_6D6Q....
633,random_ddi,D1PF15985_PF10175_6D6Q.D2PF17838_PF00071_3KZ1,report_2025-02-11_08-16.html,True,D1PF15985_PF10175_6D6Q.D2PF17838_PF00071_3KZ1....
634,random_ddi,D1PF17838_PF00071_3KZ1.D2PF18773_PF00071_2X19,report_2025-02-11_08-34.html,True,D1PF17838_PF00071_3KZ1.D2PF18773_PF00071_2X19....


### 2 Parsing the AF output
Iterates over the nextflow output folders, reads the AF data and creates a tsv files containing all the metrics. On the way it checks for missing, corrupted or unexpected data using the report_df from section 1.

In [None]:
# Parsing AF output of Nextflow
dataAF = pd.DataFrame() # Holding the output metrics and metadata of the runs
missformed_outputs = pd.DataFrame(columns=["benchmark_set", "prediction_name", "model_seed", "reason"])
empty_outputs = pd.DataFrame(columns=["benchmark_set", "nextflow_name"])

for folder in benchmark_folders:
    benchmark_set = folder.name
    print(benchmark_set)
    nextflowFolders = [p for p in folder.iterdir() if p.is_dir()]
    for nextflowFolder in nextflowFolders:
        print("\t", f"{nextflowFolder.name:<30}", end=" -> ")
        if not (metricPath := (nextflowFolder / "alphafold3_metrics.tsv")).exists():
            empty_outputs.loc[len(empty_outputs)] = {"benchmark_set":benchmark_set, "nextflow_name": nextflowFolder.name}
            print("")
            continue
        metric_file = pd.read_csv(metricPath, delimiter="\t", header=0)
        metric_file["benchmark_set"] = benchmark_set
        metric_file["model_path"] = None
        if not metric_file.shape[0] >= 1: # Testing if there is at least one run included in the metric file
            empty_outputs.loc[len(empty_outputs)] = {"benchmark_set":benchmark_set, "nextflow_name": nextflowFolder.name}
            print("")
            continue

        prediction_name = metric_file["prediction_name"][0]
        print(prediction_name)
        if not len(set(metric_file["prediction_name"])) == 1: # Testing if only one prediction name is mentioned in the metric file
            missformed_outputs.loc[len(missformed_outputs)] = {"benchmark_set": benchmark_set, "prediction_name": prediction_name, "reason": "multiple prediction_name for one structure"}
            continue
        
        if not (structureFolder := nextflowFolder / "predictions" / "alphafold3" / prediction_name).exists():
            missformed_outputs.loc[len(missformed_outputs)] = {"benchmark_set": benchmark_set, "prediction_name": prediction_name, "reason": "prediction folder does not exist"}
            continue
        for model_file in [(p / "model.cif") for p in structureFolder.iterdir() if p.is_dir() and (p / "model.cif").exists()]:
            model_seed = model_file.parent.name
            if len(metric_file.loc[metric_file["model_id"] == model_seed, ["model_path"]]) == 0: # Testing if the run with the specific seed is mentioned in the metric file
                missformed_outputs.loc[len(missformed_outputs)] = {"benchmark_set": benchmark_set, "prediction_name": prediction_name, "reason": "model seed is not contained in tsv file"}
                continue
            metric_file.loc[metric_file["model_id"] == model_seed, ["model_path"]] = model_file.relative_to(af3_runs_folder)
        
        metric_file.sort_values(by=['ranking_score'], ascending=False, ignore_index=True, inplace=True)
        metric_file["model_id"] = metric_file.apply(lambda r: f"ranked_{int(r.name)}", axis=1)
        dataAF = pd.concat([dataAF, metric_file], ignore_index=True)

dataAF.drop(columns=["project_name"], inplace=True)
# Reordering of the columns
c = list(dataAF.columns)
c.remove("prediction_name")
c.remove("model_preset")
c.remove("benchmark_set")
c.remove("ranking_score")
c.insert(0, "model_preset")
c.insert(1, "benchmark_set")
c.insert(2, "prediction_name")
c.insert(4, "ranking_score")

dataAF = dataAF[c]

known_minimal
	 happy_brenner -> lig_homeobox_1b72
	 nice_caravaggio -> doc_spak_osr1_1_2v3s
	 zen_tuckerman -> doc_usp7_math_1_3mqs
	 adoring_mercator -> doc_usp7_math_2_1yy6
	 disturbed_lichterman -> 
	 peaceful_allen -> doc_usp7_ubl2_3_4yoc
	 crazy_goodall -> lig_14-3-3_chrebp_3_5f74
	 intergalactic_lavoisier -> lig_actincp_twfcpi_2_7ds2
	 voluminous_gautier -> lig_actin_rpel_3_2v51
	 dreamy_golick -> lig_actincp_cpi_1_3aa0
	 lonely_ride -> lig_pex14_3_4bxu
	 reverent_lichterman -> lig_profilin_1_2v8c
	 sad_borg -> lig_ptap_uev_1_1m4p
	 pensive_spence -> lig_ptb_apo_2_1ntv
	 exotic_spence -> lig_rb_lxcxe_1_1gh6
	 suspicious_mclean -> lig_rb_pabgroove_1_1n4m
	 sick_snyder -> lig_rev1ctd_rir_1_2lsi
	 irreverent_bell -> lig_rpa_c_vert_1dpu
	 cheesy_wing -> lig_sh3_2_1cka
	 sharp_shaw -> deg_apcc_kenbox_2_4ggd
	 magical_heisenberg -> deg_cop1_1_5igo
	 hungry_yonath -> deg_kelch_keap1_1_2flu
	 intergalactic_shaw -> deg_kelch_keap1_2_3wn7
	 exotic_mestorf -> deg_mdm2_swib_1_1ycr
	 trustin

In [9]:
# Find missing structures and correct lower case names

report_df_ = report_df[~report_df["prediction_name"].isna()].copy() # Create copy to allow merging by lowercase prediction_name
report_df_["prediction_name_lower"] = report_df["prediction_name"].str.lower()

# Correcting lower case names
dataAF = pd.merge(
    left = dataAF,
    right = report_df_,
    how="outer", # Using outer to check for missing runs using report_df and filter in a later step
    left_on = ["benchmark_set", "prediction_name"],
    right_on = ["benchmark_set", "prediction_name_lower"],
    suffixes = ["", "_input"]
)

# Detecting missing outputs (= structures with input json but without a output nextflow folder)
missing_outputs = dataAF[np.logical_and(~dataAF["input_json"].isna(), dataAF["prediction_name"].isna())]
missing_outputs = missing_outputs[["benchmark_set", "prediction_name_input", "report_file", "run_ok", "input_json"]]
# Detect unidentified outputs (= output folders, which do not have a input.json)
unidentified_outputs = dataAF[dataAF["prediction_name_input"].isna()]
unidentified_outputs = dataAF[["benchmark_set", "prediction_name", "model_id"]]

# Filter to only include AF outputs and not input files
dataAF = dataAF[~dataAF["prediction_name"].isna()]
# Replacing AF prediction_name with the proper upper case variant
dataAF["prediction_name"] = dataAF["prediction_name_input"] 
# Drop the unnecessary columns added
dataAF.drop(columns=["prediction_name_input", "prediction_name_lower", "report_file", "run_ok", "input_json"], inplace=True) 
dataAF = dataAF.copy()

In [10]:
# Display the dataAF output and informations about potential errors
print(f"Currently {len(set(dataAF['prediction_name']))} valid AF output folders have been generated")
display(dataAF)
print("Processed files with errors or missing output")
display(missing_outputs)
print("Missformed outputs")
display(missformed_outputs)
print("Empty output folders")
display(empty_outputs)

Currently 636 valid AF output folders have been generated


Unnamed: 0,model_preset,benchmark_set,prediction_name,model_id,ranking_score,chainA_length,chainB_length,fraction_disordered,has_clash,iptm,...,chainA_intf_avg_plddt,chainB_intf_avg_plddt,intf_avg_plddt,num_chainA_intf_res,num_chainB_intf_res,num_res_res_contact,num_atom_atom_contact,iPAE,pDockQ,model_path
0,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_0,0.28,113,189,0.04,0.0,0.20,...,58.43,62.30,60.44,12,13,26,184,18.66,0.04,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
1,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_1,0.25,113,189,0.04,0.0,0.16,...,60.24,57.70,59.02,13,12,26,204,21.40,0.05,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
2,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_2,0.22,113,189,0.04,0.0,0.13,...,57.70,57.81,57.76,13,14,28,202,23.16,0.05,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
3,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_3,0.19,113,189,0.04,0.0,0.10,...,46.30,57.75,52.58,14,17,38,286,25.10,0.04,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
4,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_4,0.17,113,189,0.04,0.0,0.07,...,32.83,47.38,40.71,11,13,21,133,27.70,0.02,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3175,alphafold3,random_minimal,MTRG_PTS1_2C0L.DLIG_WD40_WDR5_WIN_2_4CY3,ranked_0,0.93,4,312,0.02,0.0,0.91,...,84.04,96.90,94.45,4,17,26,239,3.50,0.31,AlphaFold_benchmark_DMI\random_minimal\gloomy_...
3176,alphafold3,random_minimal,MTRG_PTS1_2C0L.DLIG_WD40_WDR5_WIN_2_4CY3,ranked_1,0.92,4,312,0.01,0.0,0.90,...,83.70,97.70,95.04,4,17,26,233,3.60,0.31,AlphaFold_benchmark_DMI\random_minimal\gloomy_...
3177,alphafold3,random_minimal,MTRG_PTS1_2C0L.DLIG_WD40_WDR5_WIN_2_4CY3,ranked_2,0.92,4,312,0.02,0.0,0.90,...,83.24,96.88,94.40,4,18,28,246,3.65,0.30,AlphaFold_benchmark_DMI\random_minimal\gloomy_...
3178,alphafold3,random_minimal,MTRG_PTS1_2C0L.DLIG_WD40_WDR5_WIN_2_4CY3,ranked_3,0.91,4,312,0.01,0.0,0.90,...,83.07,97.64,94.86,4,17,26,232,3.60,0.30,AlphaFold_benchmark_DMI\random_minimal\gloomy_...


Processed files with errors or missing output


Unnamed: 0,benchmark_set,prediction_name_input,report_file,run_ok,input_json


Missformed outputs


Unnamed: 0,benchmark_set,prediction_name,model_seed,reason


Empty output folders


Unnamed: 0,benchmark_set,nextflow_name
0,known_minimal,disturbed_lichterman


### 3 Output adjustments
The following sections corrent some errors and problems with the data

### 3a Undo chain naming by length
For the AF3 run, the chains have been given an ID by their length (chain A is shortest, chain B is longest). This is fine for DMI, but fails for DDI. Proper chain ID as in the solved structures is important for template depended metrics. The following cells reads the fasta files of the AF2 runs to switch chain A and chain B if necessary.

In [None]:
# Undos chain moving if necessary
# This function is based on a function of Joelle Strom in make_json_files.py (Last updated: 24.01.2025) to detect runs where the chain IDs have been switched


def have_chains_been_switched(prediction_name: str, benchmark_set: str):
    """
        Modified function from make_json_files.py to detect switched chains
            
    """
    path = af2_folders[benchmark_set] / (af2_folders[benchmark_set].name + "_" + prediction_name + ".fasta")
    if not path.exists():
        print(f"Can't find {path.name} in {path.parent.parent.name}/{path.parent.name}", end="")
        return None
    chains = {}
    with open(path, "r") as f:
        fasta_contents = f.readlines()
    i = 0
    for line in fasta_contents:
        if re.search(">",line):
            new_chain = True
            i+=1
        else:
            new_chain = False
        if new_chain:
            id = str(i)
            sequence = []
        else:
            sequence.append(line.strip("\n"))
            sequence_str = "".join(sequence)
            chains[i] = sequence_str
    if not len(chains) == 2:
        print(f"{prediction_name} has an invalid chain length of {len(chains)}", end="")
        return None
    if (l1 := len(list(chains.values())[0])) < (l2 := len(list(chains.values())[1])):
        return False
    elif l1 == l2:
        print("(same length) ", end="")
    return True

# Generate column in data frame if chains have been switched
dataAF["chains_flipped"] = None
for i, row in dataAF.iterrows():
    prediction_name, benchmark_set = row["prediction_name"], row["benchmark_set"]
    print(f"{f'{prediction_name} ({benchmark_set}) -> ':<80}", end="")
    chains_switched = have_chains_been_switched(prediction_name, benchmark_set)
    dataAF.at[i, "chains_flipped"] = chains_switched
    print(chains_switched)

PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133 (known_ddi) -> True
PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133 (known_ddi) -> True
PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133 (known_ddi) -> True
PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133 (known_ddi) -> True
PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133 (known_ddi) -> True
PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120 (known_ddi) -> True
PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120 (known_ddi) -> True
PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120 (known_ddi) -> True
PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120 (known_ddi) -> True
PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120 (known_ddi) -> True
PF00059_PF00041_1TDQ_B_resi10_resi125.A_resi85_resi186 (known_ddi) -> True
PF00059_PF00041_1TDQ_B_resi10_resi125.A_resi85_resi186 (known_ddi) -> True
PF00059_PF00041_1TDQ_B_resi10_resi125.A_resi85_resi186 (known_ddi) -> True
PF00059_PF00041_1TDQ_B_re

In [None]:
# Flip the metrics if necessary
for i, row in dataAF.iterrows():
    if not row["chains_flipped"]:
        continue
    dataAF.at[i, "chainA_length"], dataAF.at[i, "chainB_length"] = row["chainB_length"], row["chainA_length"]
    dataAF.at[i, "chainA_intf_avg_plddt"], dataAF.at[i, "chainB_intf_avg_plddt"] = row["chainB_intf_avg_plddt"], row["chainA_intf_avg_plddt"]
    dataAF.at[i, "num_chainA_intf_res"], dataAF.at[i, "num_chainB_intf_res"] = row["num_chainB_intf_res"], row["num_chainA_intf_res"]
    if row["model_id"] == "ranked_0":
        print(f"Modified {row['prediction_name']}")

Modified PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133
Modified PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133
Modified PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133
Modified PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133
Modified PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133
Modified PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120
Modified PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120
Modified PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120
Modified PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120
Modified PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120
Modified PF00059_PF00041_1TDQ_B_resi10_resi125.A_resi85_resi186
Modified PF00059_PF00041_1TDQ_B_resi10_resi125.A_resi85_resi186
Modified PF00059_PF00041_1TDQ_B_resi10_resi125.A_resi85_resi186
Modified PF00059_PF00041_1TDQ_B_resi10_resi125.A_resi85_resi186
Modified PF00059_PF00041_1TDQ_B_resi10_resi125.A_resi85_resi186
Modified PF00089_PF00095_1FLE_E_resi16_r

### 4 Converting AF3 structure files (.cif) to pdb files
The following section loads the model.cif files in the dataAF table and exports them to the destination path. Already existing structures are skipped depending on the setting _export_skip_existing_structures_. The chains are flipped as described in the sections above

In [21]:
# Converting .cif to .pdb files 
try:
    dataAF
except NameError:
    raise Exception("Please first run the cells to get the dataAF frame")

# If this property is not set, pymol will ignore the alter commands on the ID when exporting
pymol.cmd.set("pdb_retain_ids", False)
# No interest to mess up with segments instead of chain IDs
pymol.cmd.set("ignore_pdb_segi", True)

if not export_destination.exists() or not export_destination.is_dir():
    raise Exception("Your destination path does not exist")

for index, row in dataAF.iterrows():
    prediction_file = af3_runs_folder / Path(row["model_path"])
    chains_flipped = row["chains_flipped"]
    if not prediction_file.exists():
        print(f"For {row['prediction_name']} does not exist at expected location ({prediction_file.resolve()})")
        continue

    structure_folder_dest: Path = (export_destination / ("DDI" if "ddi" in str(row['benchmark_set']).lower() else "DMI") / row["benchmark_set"] / row["prediction_name"])
    structure_folder_dest.mkdir(parents=True, exist_ok=True)

    if (structure_file_dest := structure_folder_dest / (str(row["model_id"]) + ".pdb")).exists() and export_skip_existing_structures:
        print(f"{row["prediction_name"]}/{structure_file_dest.name} already processed. Skip")
        continue
    else:
        print(f"{row["prediction_name"]}/{structure_file_dest.name} ->", "Flipping chains" if chains_flipped else "")

    pymol.cmd.load(prediction_file, prediction_file.stem)

    if chains_flipped: # Reorder chains
        pymol.cmd.alter(selection="chain A", expression="chain = 'C'")
        pymol.cmd.alter(selection="chain B", expression="chain = 'A'")
        pymol.cmd.alter(selection="chain C", expression="chain = 'B'")
        pymol.cmd.alter(selection="segi A", expression="segi = 'C'")
        pymol.cmd.alter(selection="segi B", expression="segi = 'A'")
        pymol.cmd.alter(selection="segi C", expression="segi = 'B'")
        pymol.cmd.sort()
        pymol.cmd.alter(selection="chain A", expression=f"ID = (int(ID) - {pymol.cmd.count_atoms('chain B')})")
        pymol.cmd.alter(selection="chain B", expression=f"ID = (int(ID) + {pymol.cmd.count_atoms('chain A')})")
        pymol.cmd.sort()

    pymol.cmd.save(structure_file_dest)
    for o in pymol.cmd.get_object_list():
        pymol.cmd.delete(o)


PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133/ranked_0.pdb -> Flipping chains
PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133/ranked_1.pdb -> Flipping chains
PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133/ranked_2.pdb -> Flipping chains
PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133/ranked_3.pdb -> Flipping chains
PF00009_PF01873_2D74_A_resi12_resi200.B_resi21_resi133/ranked_4.pdb -> Flipping chains
PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120/ranked_0.pdb -> Flipping chains
PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120/ranked_1.pdb -> Flipping chains
PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120/ranked_2.pdb -> Flipping chains
PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120/ranked_3.pdb -> Flipping chains
PF00026_PF06394_1F34_A_resi13_resi326.B_resi62_resi120/ranked_4.pdb -> Flipping chains
PF00059_PF00041_1TDQ_B_resi10_resi125.A_resi85_resi186/ranked_0.pdb -> Flipping chains
PF00059_PF00041_1TDQ_B_resi10_resi125.A_res

In [None]:
# Helper cell: If pymol crashes, use this cell to reset pymol
for o in pymol.cmd.get_object_list():
        pymol.cmd.delete(o)

### 5 Reorder of columns

In [44]:
dataAF.columns

Index(['model_preset', 'benchmark_set', 'prediction_name', 'model_id',
       'ranking_score', 'chainA_length', 'chainB_length',
       'fraction_disordered', 'has_clash', 'iptm', 'ptm',
       'chainA_intf_avg_plddt', 'chainB_intf_avg_plddt', 'intf_avg_plddt',
       'num_chainA_intf_res', 'num_chainB_intf_res', 'num_res_res_contact',
       'num_atom_atom_contact', 'iPAE', 'pDockQ', 'model_path', 'PDB_id',
       'ELM_instance', 'ddi_pfam_id', 'PDB_id_random_paired',
       'ELM_instance_random_paired', 'ddi_pfam_id_random_paired',
       'sequence_initial', 'sequence_mutated', 'chainA_id', 'chainB_id',
       'chainA_start', 'chainA_end', 'chainB_start', 'chainB_end',
       'chains_flipped', 'num_mutations'],
      dtype='object')

In [45]:
c = list(dataAF.columns)
c.remove("model_path")
c.remove("chains_flipped")
c.append("chains_flipped")
c.append("model_path")

c.remove("num_mutations")
c.insert(4,"num_mutations")

dataAF = dataAF[c].copy()
dataAF

Unnamed: 0,model_preset,benchmark_set,prediction_name,model_id,num_mutations,ranking_score,chainA_length,chainB_length,fraction_disordered,has_clash,...,sequence_initial,sequence_mutated,chainA_id,chainB_id,chainA_start,chainA_end,chainB_start,chainB_end,chains_flipped,model_path
0,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_0,,0.28,189,113,0.04,0.0,...,,,A,B,12.0,200.0,21.0,133.0,True,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
1,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_1,,0.25,189,113,0.04,0.0,...,,,A,B,12.0,200.0,21.0,133.0,True,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
2,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_2,,0.22,189,113,0.04,0.0,...,,,A,B,12.0,200.0,21.0,133.0,True,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
3,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_3,,0.19,189,113,0.04,0.0,...,,,A,B,12.0,200.0,21.0,133.0,True,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
4,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_4,,0.17,189,113,0.04,0.0,...,,,A,B,12.0,200.0,21.0,133.0,True,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3175,alphafold3,random_minimal,MTRG_PTS1_2C0L.DLIG_WD40_WDR5_WIN_2_4CY3,ranked_0,,0.93,312,4,0.02,0.0,...,,,A,B,59.0,361.0,140.0,143.0,True,AlphaFold_benchmark_DMI\random_minimal\gloomy_...
3176,alphafold3,random_minimal,MTRG_PTS1_2C0L.DLIG_WD40_WDR5_WIN_2_4CY3,ranked_1,,0.92,312,4,0.01,0.0,...,,,A,B,59.0,361.0,140.0,143.0,True,AlphaFold_benchmark_DMI\random_minimal\gloomy_...
3177,alphafold3,random_minimal,MTRG_PTS1_2C0L.DLIG_WD40_WDR5_WIN_2_4CY3,ranked_2,,0.92,312,4,0.02,0.0,...,,,A,B,59.0,361.0,140.0,143.0,True,AlphaFold_benchmark_DMI\random_minimal\gloomy_...
3178,alphafold3,random_minimal,MTRG_PTS1_2C0L.DLIG_WD40_WDR5_WIN_2_4CY3,ranked_3,,0.91,312,4,0.01,0.0,...,,,A,B,59.0,361.0,140.0,143.0,True,AlphaFold_benchmark_DMI\random_minimal\gloomy_...


### 6 Save metric file

In [4]:
# Sorting the file
dataAF["_benchmark_id"] = dataAF["benchmark_set"].replace({"known_minimal": "1", "random_minimal": "2", "mutations": "3", "known_ddi": "4", "random_ddi": "5"}).astype(int)
dataAF.sort_values(["_benchmark_id", "prediction_name", "model_id"], inplace=True)
dataAF.drop(columns=["_benchmark_id"], inplace=True)
dataAF.reset_index(drop=True, inplace=True)

dataAF

Unnamed: 0,model_preset,benchmark_set,prediction_name,model_id,ranking_score,chainA_length,chainB_length,fraction_disordered,has_clash,iptm,...,chainB_intf_avg_plddt,intf_avg_plddt,num_chainA_intf_res,num_chainB_intf_res,num_res_res_contact,num_atom_atom_contact,iPAE,pDockQ,chains_flipped,model_path
0,alphafold3,known_minimal,DEG_APCC_KENBOX_2_4GGD,ranked_0,0.97,312,5,0.02,0.0,0.96,...,88.25,94.54,15,4,25,252,1.85,0.20,True,AlphaFold_benchmark_DMI\known_minimal\sharp_sh...
1,alphafold3,known_minimal,DEG_APCC_KENBOX_2_4GGD,ranked_1,0.97,312,5,0.02,0.0,0.96,...,88.12,94.20,15,5,26,263,1.85,0.20,True,AlphaFold_benchmark_DMI\known_minimal\sharp_sh...
2,alphafold3,known_minimal,DEG_APCC_KENBOX_2_4GGD,ranked_2,0.96,312,5,0.02,0.0,0.96,...,86.06,93.49,14,5,27,280,2.15,0.20,True,AlphaFold_benchmark_DMI\known_minimal\sharp_sh...
3,alphafold3,known_minimal,DEG_APCC_KENBOX_2_4GGD,ranked_3,0.96,312,5,0.02,0.0,0.95,...,83.80,92.56,15,5,26,261,1.90,0.15,True,AlphaFold_benchmark_DMI\known_minimal\sharp_sh...
4,alphafold3,known_minimal,DEG_APCC_KENBOX_2_4GGD,ranked_4,0.96,312,5,0.02,0.0,0.95,...,84.93,93.03,15,5,27,271,1.95,0.19,True,AlphaFold_benchmark_DMI\known_minimal\sharp_sh...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3175,alphafold3,random_ddi,D1PF18773_PF00071_2X19.D2PF00009_PF01873_2D74,ranked_0,0.36,60,113,0.22,0.0,0.19,...,64.94,59.98,12,13,24,130,12.30,0.04,False,AlphaFold_benchmark_DDI\random_ddi\angry_sange...
3176,alphafold3,random_ddi,D1PF18773_PF00071_2X19.D2PF00009_PF01873_2D74,ranked_1,0.23,60,113,0.08,0.0,0.12,...,72.04,64.46,9,11,18,120,18.69,0.04,False,AlphaFold_benchmark_DDI\random_ddi\angry_sange...
3177,alphafold3,random_ddi,D1PF18773_PF00071_2X19.D2PF00009_PF01873_2D74,ranked_2,0.22,60,113,0.14,0.0,0.07,...,51.87,50.40,8,11,20,141,22.10,0.03,False,AlphaFold_benchmark_DDI\random_ddi\angry_sange...
3178,alphafold3,random_ddi,D1PF18773_PF00071_2X19.D2PF00009_PF01873_2D74,ranked_3,0.21,60,113,0.07,0.0,0.10,...,61.57,56.68,19,17,39,290,21.80,0.06,False,AlphaFold_benchmark_DDI\random_ddi\angry_sange...


In [6]:
# Export metrics files
if not export_destination.exists() or not export_destination.is_dir():
    raise Exception("Your destination path is not valid")

dataAF.to_csv(export_destination / "AF3_output.tsv", sep="\t", index=False)
dataAF.to_excel(export_destination / "AF3_output.xlsx", sheet_name="AF3", index=False)

Need to load the metric file to recalculate some columns? Remove the comments on the following cell

In [3]:
# Load metrics file
dataAF = pd.read_csv(export_destination / "AF3_output.tsv", sep="\t")
dataAF

Unnamed: 0,model_preset,benchmark_set,prediction_name,model_id,ranking_score,chainA_length,chainB_length,fraction_disordered,has_clash,iptm,...,chainB_intf_avg_plddt,intf_avg_plddt,num_chainA_intf_res,num_chainB_intf_res,num_res_res_contact,num_atom_atom_contact,iPAE,pDockQ,chains_flipped,model_path
0,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_0,0.28,189,113,0.04,0.0,0.20,...,58.43,60.44,13,12,26,184,18.66,0.04,True,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
1,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_1,0.25,189,113,0.04,0.0,0.16,...,60.24,59.02,12,13,26,204,21.40,0.05,True,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
2,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_2,0.22,189,113,0.04,0.0,0.13,...,57.70,57.76,14,13,28,202,23.16,0.05,True,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
3,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_3,0.19,189,113,0.04,0.0,0.10,...,46.30,52.58,17,14,38,286,25.10,0.04,True,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
4,alphafold3,known_ddi,PF00009_PF01873_2D74_A_resi12_resi200.B_resi21...,ranked_4,0.17,189,113,0.04,0.0,0.07,...,32.83,40.71,13,11,21,133,27.70,0.02,True,AlphaFold_benchmark_DDI\known_ddi\suspicious_c...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3175,alphafold3,random_minimal,MTRG_PTS1_2C0L.DLIG_WD40_WDR5_WIN_2_4CY3,ranked_0,0.93,312,4,0.02,0.0,0.91,...,84.04,94.45,17,4,26,239,3.50,0.31,True,AlphaFold_benchmark_DMI\random_minimal\gloomy_...
3176,alphafold3,random_minimal,MTRG_PTS1_2C0L.DLIG_WD40_WDR5_WIN_2_4CY3,ranked_1,0.92,312,4,0.01,0.0,0.90,...,83.70,95.04,17,4,26,233,3.60,0.31,True,AlphaFold_benchmark_DMI\random_minimal\gloomy_...
3177,alphafold3,random_minimal,MTRG_PTS1_2C0L.DLIG_WD40_WDR5_WIN_2_4CY3,ranked_2,0.92,312,4,0.02,0.0,0.90,...,83.24,94.40,18,4,28,246,3.65,0.30,True,AlphaFold_benchmark_DMI\random_minimal\gloomy_...
3178,alphafold3,random_minimal,MTRG_PTS1_2C0L.DLIG_WD40_WDR5_WIN_2_4CY3,ranked_3,0.91,312,4,0.01,0.0,0.90,...,83.07,94.86,17,4,26,232,3.60,0.30,True,AlphaFold_benchmark_DMI\random_minimal\gloomy_...
