<a href="https://colab.research.google.com/github/eoinleen/Protein-design-random/blob/main/pLDDT-analyser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""
==============================================================================
PDB PLDDT SCORE CALCULATOR FOR ALPHAFOLD MODELS
==============================================================================

Author: Claude AI Assistant
Date: March 17, 2025
Version: 1.1

DESCRIPTION:
------------
This script calculates the average pLDDT (predicted Local Distance Difference Test)
scores for a specified chain in AlphaFold-generated PDB files. The pLDDT score is
a confidence metric ranging from 0-100, where higher values indicate higher
prediction confidence.

Score interpretation:
- Very high (90-100): High confidence, expected to be comparable to experimental structures
- High (70-90): Confident prediction, most sidechains correctly placed
- Medium (50-70): Domain packing may be incorrect, some regions may be unreliable
- Low (0-50): Highly speculative prediction, likely incorrect

FEATURES:
---------
- Processes all PDB files in a specified Google Drive directory
- Allows selection of which chain to analyze (default is chain B)
- Calculates per-residue pLDDT scores (avoiding duplicate atom values)
- Provides comprehensive statistics including min/max/avg scores
- Calculates confidence level distributions
- Outputs results to a CSV file for further analysis
- Displays a preview of results directly in the notebook

USAGE:
------
1. Run this script in Google Colab
2. Ensure your Google Drive is mounted
3. Set the 'pdb_directory' to point to your directory of PDB files
4. Set the 'target_chain' parameter to the chain you want to analyze (default 'A')
5. Results will be saved as 'plddt_scores.csv' in the same directory

REQUIREMENTS:
------------
- Google Colab environment
- Access to Google Drive containing PDB files
- Python packages: os, pandas, collections.defaultdict

OUTPUTS:
--------
CSV file with columns:
- file: PDB filename
- avg_plddt: Average pLDDT score for the specified chain
- num_residues: Number of residues in the chain
- min_plddt: Minimum pLDDT score
- max_plddt: Maximum pLDDT score
- high_conf_pct: Percentage of residues with scores between 70-90
- very_high_conf_pct: Percentage of residues with scores above 90
- error: Any errors encountered during processing

==============================================================================
"""

import os
import pandas as pd
from collections import defaultdict
from google.colab import files, drive
import IPython.display

# Mount Google Drive if not already mounted
def ensure_drive_mounted():
    """Check if Google Drive is mounted and mount it if necessary."""
    if not os.path.exists('/content/drive'):
        print("Mounting Google Drive...")
        drive.mount('/content/drive')
        print("Google Drive mounted.")
    else:
        print("Google Drive is already mounted.")

# Function to calculate pLDDT scores for a PDB file
def calculate_plddt(pdb_file, target_chain='A'):
    """
    Calculate average pLDDT score for specified chain in a PDB file.

    Args:
        pdb_file (str): Path to the PDB file
        target_chain (str): Chain identifier to analyze (default: 'A')

    Returns:
        dict: Results dictionary with pLDDT statistics
    """

    # Store B-factors (pLDDT values) by residue
    # Using defaultdict(list) to automatically create a new list for each new residue
    residue_bfactors = defaultdict(list)

    try:
        # Open and read the PDB file
        with open(pdb_file, 'r') as f:
            for line in f:
                # Process only ATOM records
                if line.startswith('ATOM'):
                    # Extract chain ID (column 22 in PDB format)
                    chain = line[21:22].strip()
                    if chain == target_chain:
                        # Extract residue number (columns 23-26)
                        res_num = line[22:26].strip()
                        # Extract B-factor/pLDDT (columns 61-66)
                        b_factor = float(line[60:66].strip())
                        # Create a unique key for this residue
                        residue_key = f"{chain}:{res_num}"
                        # Add the B-factor to this residue's list
                        residue_bfactors[residue_key].append(b_factor)

        # If no atoms found for the target chain
        if not residue_bfactors:
            return {
                'file': os.path.basename(pdb_file),
                'avg_plddt': None,
                'num_residues': 0,
                'min_plddt': None,
                'max_plddt': None,
                'high_conf_pct': None,
                'very_high_conf_pct': None,
                'error': f'No chain {target_chain} found'
            }

        # Calculate average pLDDT per residue
        # All atoms in a residue have the same pLDDT, but we average to be safe
        residue_avg = {}
        for res_key, values in residue_bfactors.items():
            residue_avg[res_key] = sum(values) / len(values)

        # Overall average is the average of all residue averages
        residue_values = list(residue_avg.values())
        overall_avg = sum(residue_values) / len(residue_values)

        # Calculate confidence distributions
        num_residues = len(residue_values)
        very_high_conf = sum(1 for v in residue_values if v > 90)
        high_conf = sum(1 for v in residue_values if 70 < v <= 90)

        # Return a dictionary with all statistics
        return {
            'file': os.path.basename(pdb_file),
            'avg_plddt': round(overall_avg, 2),
            'num_residues': num_residues,
            'min_plddt': round(min(residue_values), 2),
            'max_plddt': round(max(residue_values), 2),
            'high_conf_pct': round(high_conf / num_residues * 100, 1),
            'very_high_conf_pct': round(very_high_conf / num_residues * 100, 1),
            'error': None
        }

    except Exception as e:
        # Handle any exceptions that occur during processing
        return {
            'file': os.path.basename(pdb_file),
            'avg_plddt': None,
            'num_residues': 0,
            'min_plddt': None,
            'max_plddt': None,
            'high_conf_pct': None,
            'very_high_conf_pct': None,
            'error': str(e)
        }

# Function to process all PDB files in directory
def process_directory(directory_path, target_chain='B'):
    """
    Process all PDB files in the given directory.

    Args:
        directory_path (str): Path to directory containing PDB files
        target_chain (str): Chain identifier to analyze

    Returns:
        list: List of result dictionaries for each PDB file
    """
    results = []

    # Check if directory exists
    if not os.path.isdir(directory_path):
        raise ValueError(f"Directory not found: {directory_path}")

    # Get all PDB files in the directory
    pdb_files = [f for f in os.listdir(directory_path) if f.endswith('.pdb')]

    if not pdb_files:
        print("No PDB files found in the directory.")
        return results

    print(f"Found {len(pdb_files)} PDB files to process.")

    # Process each PDB file with manual progress tracking
    total = len(pdb_files)
    for i, filename in enumerate(pdb_files):
        pdb_path = os.path.join(directory_path, filename)
        # Calculate pLDDT for this file, analyzing the specified chain
        result = calculate_plddt(pdb_path, target_chain)
        results.append(result)
        # Print progress updates
        if (i + 1) % 5 == 0 or (i + 1) == total:
            print(f"Progress: {i+1}/{total} files processed ({((i+1)/total*100):.1f}%)")

    return results

# Main function to run everything
def main(drive_pdb_path, target_chain='A'):
    """
    Main function to execute the entire workflow.

    Args:
        drive_pdb_path (str): Path to directory containing PDB files
        target_chain (str): Chain identifier to analyze
    """
    # Ensure drive is mounted
    ensure_drive_mounted()

    # Check if the path exists
    if not os.path.exists(drive_pdb_path):
        print(f"Error: The path {drive_pdb_path} does not exist.")
        return

    # Process PDB files
    print(f"Processing PDB files in {drive_pdb_path}")
    print(f"Analyzing chain: {target_chain}")
    results = process_directory(drive_pdb_path, target_chain)

    # Save results to CSV
    if results:
        output_path = os.path.join(drive_pdb_path, f'plddt_scores_chain{target_chain}.csv')
        df = pd.DataFrame(results)
        df.to_csv(output_path, index=False)
        print(f"\nResults saved to {output_path}")

        # Print summary
        success_count = sum(1 for r in results if r['error'] is None)
        print(f"Successfully processed {success_count} of {len(results)} files.")

        # Display results
        print("\nResults Preview:")
        display(df)

        # Print average pLDDT across all successful files
        successful_scores = [r['avg_plddt'] for r in results if r['error'] is None and r['avg_plddt'] is not None]
        if successful_scores:
            overall_avg = sum(successful_scores) / len(successful_scores)
            print(f"\nOverall average pLDDT across all files (chain {target_chain}): {overall_avg:.2f}")
    else:
        print("No results generated. Make sure you have PDB files in the directory.")

# Run the script
if __name__ == "__main__":
    print("=== PDB pLDDT Calculator for AlphaFold Models ===\n")
    print("This script calculates average pLDDT confidence scores for a specified chain")
    print("in PDB files and generates a CSV report with detailed statistics.\n")

    # Use the path you provided
    pdb_directory = '/content/drive/MyDrive/PDB-files/20250305_current_list'

    # Specify which chain to analyze (change this to analyze a different chain)
    target_chain = 'B'

    main(pdb_directory, target_chain)

=== PDB pLDDT Calculator for AlphaFold Models ===

This script calculates average pLDDT confidence scores for a specified chain
in PDB files and generates a CSV report with detailed statistics.

Google Drive is already mounted.
Processing PDB files in /content/drive/MyDrive/PDB-files/20250305_current_list
Analyzing chain: B
Found 21 PDB files to process.
Progress: 5/21 files processed (23.8%)
Progress: 10/21 files processed (47.6%)
Progress: 15/21 files processed (71.4%)
Progress: 20/21 files processed (95.2%)
Progress: 21/21 files processed (100.0%)

Results saved to /content/drive/MyDrive/PDB-files/20250305_current_list/plddt_scores_chainB.csv
Successfully processed 21 of 21 files.

Results Preview:


Unnamed: 0,file,avg_plddt,num_residues,min_plddt,max_plddt,high_conf_pct,very_high_conf_pct,error
0,design21_n10.pdb,89.78,42,78.17,95.88,52.4,47.6,
1,3NOBEK_strand_l81_s451607_mpnn5_model2.pdb,88.66,81,57.39,94.21,48.1,49.4,
2,3NOBEK_l99_s437645_mpnn19_model2.pdb,88.82,99,42.45,96.93,35.4,59.6,
3,3NOBEK_l144_s164418_mpnn3_model1.pdb,93.22,144,59.84,97.74,9.0,89.6,
4,3NOBEK_l113_s712433_mpnn3_model1.pdb,93.22,113,67.85,97.67,10.6,88.5,
5,3NOBEK_l143_s311379_mpnn1_model1.pdb,90.12,143,55.02,97.03,25.9,70.6,
6,3NOBEK_l103_s333072_mpnn3_model2.pdb,93.01,103,68.5,97.79,11.7,87.4,
7,3NOBEK_l93_s898214_mpnn4_model1.pdb,90.91,93,55.49,96.74,23.7,73.1,
8,3NOBEK_l128_s993298_mpnn1_model2.pdb,93.17,128,80.65,97.99,17.2,82.8,
9,3NOBEK_l117_s150447_mpnn3_model2.pdb,91.1,117,68.4,97.01,28.2,70.9,



Overall average pLDDT across all files (chain B): 91.47


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
