<a href="https://colab.research.google.com/github/eoinleen/Protein-design-random/blob/main/trb_pdb_matching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
#!/usr/bin/env python3
"""
PDB/TRB File Matcher Script for Google Colab
Compares .pdb and .trb files from a file listing and identifies orphan files.

===============================================================================
 PDB/TRB File Matcher for Google Colab and Local Use
===============================================================================
 Author:      Eoin "Scone" Leen
 Created:     July 2025
 Description:
     This script analyzes the output of an 'ls -la' directory listing to
     identify and compare PDB (.pdb) and TRB (.trb) files. It reports:
        - Matched PDB/TRB file pairs
        - Orphan PDB files (no corresponding .trb)
        - Orphan TRB files (no corresponding .pdb)
     It also saves the results into text files for further processing.

 Features:
     ‚úÖ Google Colab integration with file upload
     ‚úÖ Local file support via filename input
     ‚úÖ Clear summary report with matched and orphan files
     ‚úÖ Test mode with built-in sample data
     ‚úÖ Safe file writing for matched and orphan lists

 Usage:
     ‚ñ∂Ô∏è In Google Colab:
         - Run the cell to trigger file upload and analysis.
     üíª Locally:
         - Save your 'ls -la' output to a text file.
         - Run: analyze_with_filename('your_file.txt')

 Output:
     - matched_pairs.txt
     - orphan_pdbs.txt
     - orphan_trbs.txt

 Requirements:
     - Standard Python 3 libraries
     - google.colab module (for Colab use only)

===============================================================================

"""

import re
from collections import defaultdict

def parse_file_listing(file_content):
    """
    Parse the output of 'ls -la' command and extract filenames.

    Args:
        file_content (str): Content of the file listing

    Returns:
        tuple: (pdb_files, trb_files) - lists of filenames
    """
    lines = file_content.strip().split('\n')
    pdb_files = []
    trb_files = []

    for line in lines:
        if not line.strip():
            continue

        # Extract filename from ls -la output (last column)
        parts = line.split()
        if len(parts) >= 9:  # ls -la has at least 9 columns
            filename = parts[-1]  # Last part is the filename

            if filename.endswith('.pdb'):
                pdb_files.append(filename)
            elif filename.endswith('.trb'):
                trb_files.append(filename)

    return sorted(pdb_files), sorted(trb_files)

def get_base_name(filename):
    """
    Extract base name by removing file extension.

    Args:
        filename (str): Full filename with extension

    Returns:
        str: Base name without extension
    """
    return re.sub(r'\.(pdb|trb)$', '', filename)

def analyze_file_pairs(pdb_files, trb_files):
    """
    Analyze PDB and TRB files to find matches and orphans.

    Args:
        pdb_files (list): List of PDB filenames
        trb_files (list): List of TRB filenames

    Returns:
        dict: Analysis results
    """
    # Get base names
    pdb_bases = {get_base_name(f): f for f in pdb_files}
    trb_bases = {get_base_name(f): f for f in trb_files}

    # Find matches and orphans
    all_bases = set(pdb_bases.keys()) | set(trb_bases.keys())

    matched_pairs = []
    orphan_pdbs = []
    orphan_trbs = []

    for base in sorted(all_bases):
        has_pdb = base in pdb_bases
        has_trb = base in trb_bases

        if has_pdb and has_trb:
            matched_pairs.append({
                'base': base,
                'pdb': pdb_bases[base],
                'trb': trb_bases[base]
            })
        elif has_pdb and not has_trb:
            orphan_pdbs.append(pdb_bases[base])
        elif has_trb and not has_pdb:
            orphan_trbs.append(trb_bases[base])

    return {
        'matched_pairs': matched_pairs,
        'orphan_pdbs': orphan_pdbs,
        'orphan_trbs': orphan_trbs,
        'total_pdb': len(pdb_files),
        'total_trb': len(trb_files),
        'total_matched': len(matched_pairs)
    }

def print_analysis_report(analysis):
    """
    Print a detailed analysis report.

    Args:
        analysis (dict): Analysis results from analyze_file_pairs
    """
    print("=" * 60)
    print("PDB/TRB FILE ANALYSIS REPORT")
    print("=" * 60)

    print(f"\nüìä SUMMARY:")
    print(f"  Total PDB files: {analysis['total_pdb']}")
    print(f"  Total TRB files: {analysis['total_trb']}")
    print(f"  Matched pairs: {analysis['total_matched']}")
    print(f"  Orphan PDB files: {len(analysis['orphan_pdbs'])}")
    print(f"  Orphan TRB files: {len(analysis['orphan_trbs'])}")

    print(f"\n‚úÖ MATCHED PAIRS ({len(analysis['matched_pairs'])}):")
    if analysis['matched_pairs']:
        for i, pair in enumerate(analysis['matched_pairs'][:10], 1):  # Show first 10
            print(f"  {i:2d}. {pair['base']}")
        if len(analysis['matched_pairs']) > 10:
            print(f"     ... and {len(analysis['matched_pairs']) - 10} more")
    else:
        print("  No matched pairs found!")

    print(f"\nüî¥ ORPHAN PDB FILES ({len(analysis['orphan_pdbs'])}):")
    if analysis['orphan_pdbs']:
        for i, pdb in enumerate(analysis['orphan_pdbs'][:10], 1):  # Show first 10
            print(f"  {i:2d}. {pdb}")
        if len(analysis['orphan_pdbs']) > 10:
            print(f"     ... and {len(analysis['orphan_pdbs']) - 10} more")
    else:
        print("  No orphan PDB files!")

    print(f"\nüî¥ ORPHAN TRB FILES ({len(analysis['orphan_trbs'])}):")
    if analysis['orphan_trbs']:
        for i, trb in enumerate(analysis['orphan_trbs'][:10], 1):  # Show first 10
            print(f"  {i:2d}. {trb}")
        if len(analysis['orphan_trbs']) > 10:
            print(f"     ... and {len(analysis['orphan_trbs']) - 10} more")
    else:
        print("  No orphan TRB files!")

def save_lists_to_files(analysis):
    """
    Save the different file lists to text files.

    Args:
        analysis (dict): Analysis results from analyze_file_pairs
    """
    # Save matched pairs
    with open('matched_pairs.txt', 'w') as f:
        f.write("# Matched PDB/TRB pairs (base names)\n")
        for pair in analysis['matched_pairs']:
            f.write(f"{pair['base']}\n")

    # Save orphan PDBs
    with open('orphan_pdbs.txt', 'w') as f:
        f.write("# PDB files without corresponding TRB files\n")
        for pdb in analysis['orphan_pdbs']:
            f.write(f"{pdb}\n")

    # Save orphan TRBs
    with open('orphan_trbs.txt', 'w') as f:
        f.write("# TRB files without corresponding PDB files\n")
        for trb in analysis['orphan_trbs']:
            f.write(f"{trb}\n")

    print(f"\nüíæ FILES SAVED:")
    print(f"  matched_pairs.txt - {len(analysis['matched_pairs'])} matched base names")
    print(f"  orphan_pdbs.txt - {len(analysis['orphan_pdbs'])} orphan PDB files")
    print(f"  orphan_trbs.txt - {len(analysis['orphan_trbs'])} orphan TRB files")

# Google Colab file upload function
def upload_and_analyze():
    """
    Main function for Google Colab with direct file upload.
    """
    from google.colab import files
    import io

    print("PDB/TRB File Matcher for Google Colab")
    print("=" * 40)

    print("\nüì§ Please upload your file listing (ls -la output):")
    uploaded = files.upload()

    if not uploaded:
        print("‚ùå No file uploaded!")
        return

    # Process the uploaded file
    filename = list(uploaded.keys())[0]
    file_content = uploaded[filename].decode('utf-8')

    print(f"‚úÖ Successfully uploaded and read file: {filename}")
    print(f"üìÑ File size: {len(file_content)} characters")

    try:
        # Parse the file listing
        pdb_files, trb_files = parse_file_listing(file_content)

        # Analyze the files
        analysis = analyze_file_pairs(pdb_files, trb_files)

        # Print the report
        print_analysis_report(analysis)

        # Save results to files
        save_lists_to_files(analysis)

        print(f"\nüéâ Analysis complete!")
        print(f"üì• You can download the generated files from the Colab file browser.")

    except Exception as e:
        print(f"‚ùå Error processing file: {str(e)}")

# Alternative function for manual filename input (if upload doesn't work)
def analyze_with_filename(filename):
    """
    Backup function if file upload doesn't work.

    Args:
        filename (str): Name of the uploaded file
    """
    try:
        with open(filename, 'r') as f:
            file_content = f.read()

        print(f"‚úÖ Successfully read file: {filename}")

        # Parse the file listing
        pdb_files, trb_files = parse_file_listing(file_content)

        # Analyze the files
        analysis = analyze_file_pairs(pdb_files, trb_files)

        # Print the report
        print_analysis_report(analysis)

        # Save results to files
        save_lists_to_files(analysis)

    except FileNotFoundError:
        print(f"‚ùå Error: Could not find file '{filename}'")
        print("   Make sure the file is uploaded to Colab.")
    except Exception as e:
        print(f"‚ùå Error: {str(e)}")

# For Google Colab - run this to start the file upload
def main():
    """
    Main function - automatically triggers file upload in Colab
    """
    try:
        # Try to import google.colab to check if we're in Colab
        from google.colab import files
        upload_and_analyze()
    except ImportError:
        print("‚ö†Ô∏è  Not running in Google Colab!")
        print("üìù Please use analyze_with_filename('your_file.txt') instead")
        print("   or run the test example below.")

# For Google Colab - run this to start!
main()

# Alternative usage examples:
"""
üöÄ GOOGLE COLAB USAGE:
Just run the cell! The script will automatically prompt you to upload a file.

üìÅ MANUAL FILE USAGE (if upload fails):
1. Upload your file to Colab manually
2. Use: analyze_with_filename('your_uploaded_file.txt')

üß™ TEST WITH SAMPLE DATA:
Run the code below to see how it works with sample data.
"""
if __name__ == "__main__":
    # Example usage with sample data
    sample_data = """
-rw-r--r--. 1 fbselee Domain Users 130784 Jul  8 12:59 dir1_noise0.8_20250705_114135_0.pdb
-rw-r--r--. 1 fbselee Domain Users 130248 Jul  8 12:59 dir1_noise0.8_20250705_114135_19.pdb
-rw-r--r--. 1 fbselee Domain Users 141484 Jul  8 12:59 dir1_noise0.8_20250705_114135_0.trb
-rw-r--r--. 1 fbselee Domain Users 143311 Jul  8 12:59 dir1_noise0.8_20250705_114135_18.trb
    """

    print("üß™ TESTING WITH SAMPLE DATA:")
    pdb_files, trb_files = parse_file_listing(sample_data)
    analysis = analyze_file_pairs(pdb_files, trb_files)
    print_analysis_report(analysis)

    print("\n" + "="*60)
    print("To use with your data:")
    print("1. Upload your file listing to Google Colab")
    print("2. Change 'input_filename' variable to your filename")
    print("3. Uncomment the 'main()' line at the bottom")
    print("4. Run the script!")

PDB/TRB File Matcher for Google Colab

üì§ Please upload your file listing (ls -la output):


Saving final.txt to final.txt
‚úÖ Successfully uploaded and read file: final.txt
üìÑ File size: 41160 characters
PDB/TRB FILE ANALYSIS REPORT

üìä SUMMARY:
  Total PDB files: 240
  Total TRB files: 240
  Matched pairs: 240
  Orphan PDB files: 0
  Orphan TRB files: 0

‚úÖ MATCHED PAIRS (240):
   1. dir1_noise0-8_20250705_0
   2. dir1_noise0-8_20250705_1
   3. dir1_noise0-8_20250705_10
   4. dir1_noise0-8_20250705_11
   5. dir1_noise0-8_20250705_12
   6. dir1_noise0-8_20250705_13
   7. dir1_noise0-8_20250705_14
   8. dir1_noise0-8_20250705_15
   9. dir1_noise0-8_20250705_16
  10. dir1_noise0-8_20250705_17
     ... and 230 more

üî¥ ORPHAN PDB FILES (0):
  No orphan PDB files!

üî¥ ORPHAN TRB FILES (0):
  No orphan TRB files!

üíæ FILES SAVED:
  matched_pairs.txt - 240 matched base names
  orphan_pdbs.txt - 0 orphan PDB files
  orphan_trbs.txt - 0 orphan TRB files

üéâ Analysis complete!
üì• You can download the generated files from the Colab file browser.
üß™ TESTING WITH SAMPLE 