# Consensus Docking Results Analysis

This notebook analyzes large-scale consensus docking results to evaluate binding pose consistency and perform cluster-based selectivity analysis.

## üöÄ Quick Start Guide

**For most users:** Simply run cells 2-4 to load, filter, and start analysis immediately.

**First-time users or data updates:** If you need to create/update the data files, run the optimized data preparation script:
```bash
python parse_prepare.py
```
This high-performance script uses multiprocessing to efficiently process millions of docking results in minutes instead of hours.

## üìä Analysis Overview

### Main Analysis Workflow
1. **Data Loading** (Step 1) - Smart loading of existing data files
2. **Cluster Integration** (Step 2) - Add cavity similarity information  
3. **Tool Coverage Filtering** (Step 2.5) - **NEW:** Filter for complete tool coverage
4. **Data Quality Check** (Step 3) - Dataset overview of filtered data
5. **Tool Reliability Analysis** (Step 4) - Consensus analysis between tools
6. **Cluster Analysis** - Binding site similarity and drug selectivity
7. **Visualizations** - Comprehensive plots and insights

### Key Outputs
- **Fair tool comparisons** using only drug-target pairs with complete tool coverage
- Pose consistency metrics across docking tools
- Drug-target binding success rates
- Cluster-based selectivity patterns
- Tool agreement analysis

---

## üì• Step 1: Smart Data Loading

This cell automatically detects and loads the best available data source. Run this first!

In [16]:
# =============================================================================
# üì• SMART DATA LOADING - START HERE
# =============================================================================

import os, re
import polars as pl
from pathlib import Path

# Configuration
PARQUET_FILE = "combined_consensus_docking_results.parquet"
CSV_FILE = "combined_consensus_docking_results.csv"
BASE_FOLDER = "/media/onur/Elements/cavity_space_consensus_docking/2025_06_29_batch_dock/"

print("üîç Checking for existing data files...")

# Smart data loading: try parquet first, then CSV, then create from scratch
combined_results = None

if os.path.exists(os.path.join(BASE_FOLDER, PARQUET_FILE)):
    print(f"‚úÖ Found Parquet file: {PARQUET_FILE}")
    print("üìñ Loading data (this is the fastest option)...")
    combined_results = pl.read_parquet(os.path.join(BASE_FOLDER, PARQUET_FILE))
    print(f"   Shape: {combined_results.shape}")
    print(f"   Memory: {combined_results.estimated_size() / (1024*1024):.1f} MB")
    print("‚úÖ Data loaded successfully!")

elif os.path.exists(os.path.join(BASE_FOLDER, CSV_FILE)):
    print(f"‚úÖ Found CSV file: {CSV_FILE}")
    print("üìñ Loading data (slower than Parquet but still good)...")
    combined_results = pl.read_csv(os.path.join(BASE_FOLDER, CSV_FILE))
    print(f"   Shape: {combined_results.shape}")
    print(f"   Memory: {combined_results.estimated_size() / (1024*1024):.1f} MB")
    print("‚úÖ Data loaded successfully!")

else:
    print("‚ùå No preprocessed data files found!")
    print(f"   Looking for: {PARQUET_FILE} or {CSV_FILE}")
    print("\nÔøΩ To create the data files, run the optimized preparation script:")
    print("   ```bash")
    print("   python parse_prepare.py")
    print("   ```")
    print("\n‚ö° This high-performance script features:")
    print("   ‚Ä¢ Multiprocessing across all CPU cores")
    print("   ‚Ä¢ Progress bars for visual feedback")
    print("   ‚Ä¢ Processing rate: ~25,000 records/second")
    print("   ‚Ä¢ Creates both CSV and Parquet formats")
    print("   ‚Ä¢ Typical runtime: 5-10 minutes for millions of records")
    combined_results = pl.DataFrame()  # Empty dataframe

# Quick validation
if not combined_results.is_empty():
    print(f"\nüìä Dataset Overview:")
    print(f"   Total rows: {combined_results.height:,}")
    print(f"   Columns: {combined_results.width}")
    print(f"   Key columns: {combined_results.columns}")
    
    # Check for required columns
    required_cols = ['drugbank_id', 'uniprot_id', 'RMSD', 'Score1', 'Score2']
    missing_cols = [col for col in required_cols if col not in combined_results.columns]
    if missing_cols:
        print(f"‚ö†Ô∏è  Missing columns: {missing_cols}")
    else:
        print("‚úÖ All required columns present")
else:
    print("\n‚ö†Ô∏è  No data available for analysis")
    print("   Please run: python parse_prepare.py")

üîç Checking for existing data files...
‚úÖ Found Parquet file: combined_consensus_docking_results.parquet
üìñ Loading data (this is the fastest option)...
   Shape: (28487256, 30)
   Memory: 9700.8 MB
‚úÖ Data loaded successfully!

üìä Dataset Overview:
   Total rows: 28,487,256
   Columns: 30
   Key columns: ['Tool1', 'Tool2', 'PoseNumber1', 'PoseNumber2', 'Score1', 'Score2', 'File1', 'File2', 'RMSD', 'source_file', 'source_dir', 'file_size_mb', 'source_type', 'drugbank_id', 'uniprot_id', 'gene_name', 'cavity_index', 'Pose', 'SMINA_Score', 'Score', 'S(PLP)', 'S(hbond)', 'S(cho)', 'S(metal)', 'DE(clash)', 'DE(tors)', 'time', 'LeDock_Score', 'primary_tool', 'compound_target_pair']
‚úÖ All required columns present
   Shape: (28487256, 30)
   Memory: 9700.8 MB
‚úÖ Data loaded successfully!

üìä Dataset Overview:
   Total rows: 28,487,256
   Columns: 30
   Key columns: ['Tool1', 'Tool2', 'PoseNumber1', 'PoseNumber2', 'Score1', 'Score2', 'File1', 'File2', 'RMSD', 'source_file', 'source

## üè∑Ô∏è Step 1.5: Clarify Score Column Names

Add tool-specific score columns for easier downstream analysis. This creates new columns like `GOLD_Score`, `Smina_Score`, and `LeDock_Score` based on the Tool1/Tool2 and Score1/Score2 values, making it clearer which score belongs to which docking tool.

In [17]:
# =============================================================================
# üè∑Ô∏è STEP 1.5: CLARIFY SCORE COLUMN NAMES
# =============================================================================

print("üè∑Ô∏è CLARIFYING SCORE COLUMN NAMES")
print("=" * 50)

if not combined_results.is_empty():
    # Check if we have the required columns
    if all(col in combined_results.columns for col in ['Tool1', 'Tool2', 'Score1', 'Score2']):
        print("üìä Creating tool-specific score columns...")
        
        # Get unique tools to see what we're working with
        tools = set(combined_results['Tool1'].unique().to_list() + combined_results['Tool2'].unique().to_list())
        tools = [t for t in tools if t is not None]
        print(f"   Detected tools: {sorted(tools)}")
        
        # Create new columns for each tool's score
        # We'll initialize them as null and then populate based on Tool1/Tool2 values
        combined_results = combined_results.with_columns([
            # For GOLD - check both Tool1 and Tool2
            pl.when(pl.col('Tool1') == 'GOLD')
              .then(pl.col('Score1'))
              .when(pl.col('Tool2') == 'GOLD')
              .then(pl.col('Score2'))
              .otherwise(None)
              .alias('GOLD_Score'),
            
            # For Smina - check both Tool1 and Tool2
            pl.when(pl.col('Tool1') == 'Smina')
              .then(pl.col('Score1'))
              .when(pl.col('Tool2') == 'Smina')
              .then(pl.col('Score2'))
              .otherwise(None)
              .alias('Smina_Score'),
            
            # For LeDock - check both Tool1 and Tool2
            pl.when(pl.col('Tool1') == 'LeDock')
              .then(pl.col('Score1'))
              .when(pl.col('Tool2') == 'LeDock')
              .then(pl.col('Score2'))
              .otherwise(None)
              .alias('LeDock_Score')
        ])
        
        # Verify the new columns
        print(f"\n‚úÖ Created tool-specific score columns:")
        for tool in ['GOLD', 'Smina', 'LeDock']:
            col_name = f'{tool}_Score'
            if col_name in combined_results.columns:
                non_null_count = combined_results[col_name].drop_nulls().len()
                total_count = combined_results.height
                print(f"   {col_name}: {non_null_count:,}/{total_count:,} non-null values ({non_null_count/total_count*100:.1f}%)")
        
        # Show sample data
        print(f"\nüìã Sample data with tool-specific scores:")
        sample_cols = ['Tool1', 'Score1', 'Tool2', 'Score2', 'GOLD_Score', 'Smina_Score', 'LeDock_Score']
        available_cols = [col for col in sample_cols if col in combined_results.columns]
        sample = combined_results.select(available_cols).head(5)
        print(sample)
        
        print(f"\nüéØ Score columns clarified successfully!")
        
    else:
        missing_cols = [col for col in ['Tool1', 'Tool2', 'Score1', 'Score2'] if col not in combined_results.columns]
        print(f"‚ö†Ô∏è  Missing required columns: {missing_cols}")
        print("   Skipping score column clarification...")
else:
    print("‚ùå No data available for score column clarification")
    print("   Please load data first (Step 1)")

üè∑Ô∏è CLARIFYING SCORE COLUMN NAMES
üìä Creating tool-specific score columns...
   Detected tools: ['GOLD', 'LeDock', 'Smina']

‚úÖ Created tool-specific score columns:
   Detected tools: ['GOLD', 'LeDock', 'Smina']

‚úÖ Created tool-specific score columns:
   GOLD_Score: 17,050,379/28,487,256 non-null values (59.9%)
   Smina_Score: 20,997,094/28,487,256 non-null values (73.7%)
   LeDock_Score: 17,785,175/28,487,256 non-null values (62.4%)

üìã Sample data with tool-specific scores:
shape: (5, 7)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ Tool1  ‚îÜ Score1 ‚îÜ Tool2 ‚îÜ Score2 ‚îÜ GOLD_Score ‚îÜ Smina_Score ‚îÜ LeDock_Score ‚îÇ
‚îÇ ---    ‚îÜ ---    ‚îÜ ---   ‚îÜ ---    ‚îÜ ---        ‚îÜ ---         ‚îÜ ---          ‚îÇ
‚îÇ str    ‚îÜ f64    ‚îÜ str   ‚îÜ f64    ‚îÜ f64        ‚îÜ f64    

## üß¨ Step 2: Cluster Integration

Add cavity cluster information for advanced analysis (run once per session).

In [19]:
# =============================================================================
# üóÉÔ∏è STEP 2: CAVITY CLUSTER INTEGRATION
# =============================================================================

print("üóÉÔ∏è CAVITY CLUSTER INTEGRATION")
print("=" * 50)

if not combined_results.is_empty():
    # Check if we already have cluster information
    if 'cavity_cluster_id' in combined_results.columns:
        non_null_clusters = combined_results['cavity_cluster_id'].drop_nulls().len()
        if non_null_clusters > 0:
            print(f"‚úÖ Cluster data already present: {non_null_clusters:,} mapped entries")
            print("   Skipping cluster integration...")
        else:
            print("‚ö†Ô∏è  Cluster column exists but empty - proceeding with integration...")
    
    # Proceed with cluster integration if needed
    if 'cavity_cluster_id' not in combined_results.columns or combined_results['cavity_cluster_id'].drop_nulls().len() == 0:
        try:
            print(f"üìñ Loading cavity cluster data...")
            cluster_file = "/opt/data/cavity_space/cavity_cluster_similarity07.csv"
            clusters_df = pl.read_csv(cluster_file, separator='\t')
            print(f"üìñ Loaded {clusters_df.height:,} clusters from CavitySpace")
            
            # Extract uniprot_id and cavity_index from source_dir if not already present
            print(f"ÔøΩ Extracting cavity identifiers from source paths...")
            
            combined_results = combined_results.with_columns([
                # Extract drugbank_id (1st component before first underscore)
                pl.col('source_dir').str.split('/').list.last()
                .str.extract(r'(DB\d+)_', group_index=1)
                .alias('extracted_drugbank_id'),
                
                # Extract gene_name (2nd component - can be 'nan' for negative samples)
                # Pattern: DB00035_AVPR1B_P47901_cavity_1 (positive) or DB00035_nan_P08173_cavity_3 (negative)
                pl.col('source_dir').str.split('/').list.last()
                .str.extract(r'DB\d+_([A-Z0-9]+|nan)_', group_index=1)
                .alias('extracted_gene_name'),
                
                # Extract uniprot_id (component before 'cavity_')
                # More flexible pattern to handle both positive and negative samples
                pl.col('source_dir').str.split('/').list.last()
                .str.extract(r'_([A-Z0-9]+)_cavity_\d+', group_index=1)
                .alias('extracted_uniprot_id'),
                
                # Extract cavity_index (number after 'cavity_')
                pl.col('source_dir').str.split('/').list.last()
                .str.extract(r'cavity_(\d+)', group_index=1)
                .cast(pl.Int64, strict=False)
                .alias('extracted_cavity_index')
            ])
            
            # Check extraction results
            non_null_uniprot = combined_results['extracted_uniprot_id'].drop_nulls().len()
            non_null_cavity = combined_results['extracted_cavity_index'].drop_nulls().len()
            
            print(f"‚úÖ Extraction results:")
            print(f"   Extracted uniprot_id: {non_null_uniprot:,} non-null values")
            print(f"   Extracted cavity_index: {non_null_cavity:,} non-null values")
            
            # Show sample extracted data for debugging
            sample_data = combined_results.select(['source_dir', 'extracted_uniprot_id', 'extracted_cavity_index']).head(3)
            print(f"   Sample extracted data:")
            print(sample_data)
            
            if non_null_uniprot > 0 and non_null_cavity > 0:
                # Create mapping dictionary from the cluster file
                cavity_to_cluster = {}
                successful_parses = 0
                failed_parses = 0
                
                print(f"üîÑ Processing cluster file to create mapping...")
                
                for i, row in enumerate(clusters_df.to_dicts()):
                    cluster_id = row['id']  # The cluster ID
                    cavity_items = row['items']  # Comma-separated cavity IDs
                    
                    # Split the cavity items and process each one
                    if cavity_items and isinstance(cavity_items, str):
                        cavity_ids = cavity_items.split(',')
                        
                        for cavity_id in cavity_ids:
                            cavity_id = cavity_id.strip()
                            
                            # Parse cavity format: AF-{UniProtID}-F{Fragment}-model_v1_C{CavityIndex}
                            match = re.match(r'AF-([A-Z0-9]+)-F\d+-model_v1_C(\d+)', cavity_id)
                            if match:
                                uniprot_id, cavity_index = match.groups()
                                key = (uniprot_id, int(cavity_index))
                                cavity_to_cluster[key] = cluster_id
                                successful_parses += 1
                            else:
                                failed_parses += 1
                                if failed_parses <= 5:  # Show first few failures
                                    print(f"   ‚ö†Ô∏è Failed to parse cavity ID: '{cavity_id}'")
                
                print(f"üìä Cluster parsing results:")
                print(f"   Successfully parsed: {successful_parses:,} cavity IDs")
                print(f"   Failed to parse: {failed_parses:,} cavity IDs")
                print(f"   Created mapping for {len(cavity_to_cluster):,} unique cavities")
                
                # Show sample mapping entries
                sample_keys = list(cavity_to_cluster.keys())[:5]
                print(f"   Sample mappings:")
                for key in sample_keys:
                    print(f"     {key} -> cluster {cavity_to_cluster[key]}")
                
                # Check what UniProt IDs we have in our data vs cluster file
                our_uniprots = set(combined_results.filter(pl.col('extracted_uniprot_id').is_not_null())['extracted_uniprot_id'].unique().to_list())
                cluster_uniprots = set(key[0] for key in cavity_to_cluster.keys())
                
                print(f"\nüîç UniProt ID overlap analysis:")
                print(f"   UniProts in our data: {len(our_uniprots):,}")
                print(f"   UniProts in cluster file: {len(cluster_uniprots):,}")
                print(f"   Overlap: {len(our_uniprots & cluster_uniprots):,}")
                
                # Show sample UniProts from each set
                print(f"   Sample from our data: {sorted(list(our_uniprots))[:5]}")
                print(f"   Sample from clusters: {sorted(list(cluster_uniprots))[:5]}")
                
                # Map clusters to our data
                def map_cluster(uniprot_id, cavity_index):
                    if cavity_index is None or uniprot_id is None:
                        return None
                    key = (uniprot_id, cavity_index)
                    return cavity_to_cluster.get(key)
                
                print(f"\nüîÑ Applying cluster mapping...")
                
                combined_results = combined_results.with_columns([
                    pl.struct(['extracted_uniprot_id', 'extracted_cavity_index'])
                    .map_elements(lambda x: map_cluster(x['extracted_uniprot_id'], x['extracted_cavity_index']), return_dtype=pl.Int64)
                    .alias('cavity_cluster_id')
                ])
                
                # Report mapping results
                mapped_count = combined_results['cavity_cluster_id'].drop_nulls().len()
                total_count = len(combined_results)
                unique_clusters = combined_results['cavity_cluster_id'].n_unique()
                
                print(f"‚úÖ Cluster mapping complete:")
                print(f"   Mapped: {mapped_count:,}/{total_count:,} ({mapped_count/total_count*100:.1f}%)")
                print(f"   Unique clusters: {unique_clusters:,}")
                
                if mapped_count > 0:
                    # Show sample mapped data
                    sample_mapped = combined_results.filter(pl.col('cavity_cluster_id').is_not_null()).select(['extracted_uniprot_id', 'extracted_cavity_index', 'cavity_cluster_id']).head(3)
                    print(f"   Sample mapped data:")
                    print(sample_mapped)
                    
                # Debug unmapped entries
                if mapped_count < total_count:
                    print(f"\nüîç Debugging unmapped entries:")
                    unmapped = combined_results.filter(pl.col('cavity_cluster_id').is_null())
                    sample_unmapped = unmapped.select(['extracted_uniprot_id', 'extracted_cavity_index']).head(5)
                    print(f"   Sample unmapped entries:")
                    print(sample_unmapped)
                    
                    # Check if these should have mappings
                    for row in sample_unmapped.to_dicts():
                        uniprot_id = row['extracted_uniprot_id']
                        cavity_index = row['extracted_cavity_index']
                        key = (uniprot_id, cavity_index)
                        if key in cavity_to_cluster:
                            print(f"   ‚ùó Key {key} should map to cluster {cavity_to_cluster[key]} but doesn't!")
                        else:
                            print(f"   ‚úì Key {key} correctly not in cluster mapping")
                
            else:
                print("‚ùå Extraction failed - adding empty cluster column...")
                combined_results = combined_results.with_columns([
                    pl.lit(None, dtype=pl.Int64).alias('cavity_cluster_id')
                ])
            
        except FileNotFoundError:
            print(f"‚ö†Ô∏è  Cluster file not found: {cluster_file}")
            print("   Adding empty cluster column...")
            combined_results = combined_results.with_columns([
                pl.lit(None, dtype=pl.Int64).alias('cavity_cluster_id')
            ])
        except Exception as e:
            print(f"‚ùå Error loading clusters: {e}")
            print(f"   Error details: {type(e).__name__}: {str(e)}")
            combined_results = combined_results.with_columns([
                pl.lit(None, dtype=pl.Int64).alias('cavity_cluster_id')
            ])
    
    print(f"üéØ Ready for analysis with shape: {combined_results.shape}")
    
else:
    print("‚ùå No data available for cluster integration")
    print("   Please load data first (Step 1)")

üóÉÔ∏è CAVITY CLUSTER INTEGRATION
üìñ Loading cavity cluster data...
üìñ Loaded 12,943 clusters from CavitySpace
ÔøΩ Extracting cavity identifiers from source paths...
‚úÖ Extraction results:
   Extracted uniprot_id: 6,735,070 non-null values
   Extracted cavity_index: 6,935,086 non-null values
   Sample extracted data:
shape: (3, 3)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ source_dir                      ‚îÜ extracted_uniprot_id ‚îÜ extracted_cavity_index ‚îÇ
‚îÇ ---                             ‚îÜ ---                  ‚îÜ ---                    ‚îÇ
‚îÇ str                             ‚îÜ str                  ‚îÜ i64                    ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

## üß™ Step 2.1: IC50 Data Integration

Add experimental IC50/Ki measurements for drug-target interactions from the Therapeutic Target Database (TTD). This provides quantitative binding affinity data that can be used to:
- Validate docking predictions against experimental measurements
- Compare predicted binding scores with actual IC50 values
- Enrich the dataset with pharmacological activity information

The IC50 data was generated using the `create_ic50_mapping.py` script which integrates:
- TTD target information mapped to UniProt IDs
- TTD activity measurements mapped to DrugBank IDs via PubChem Compound IDs
- IC50 and Ki values (all converted to nM)

In [None]:
# =============================================================================
# üß™ STEP 2.1: IC50 DATA INTEGRATION
# =============================================================================

print("üß™ IC50 DATA INTEGRATION")
print("=" * 50)

if not combined_results.is_empty():
    # Check if we already have IC50 information
    if 'ic50_value' in combined_results.columns:
        non_null_ic50 = combined_results['ic50_value'].drop_nulls().len()
        if non_null_ic50 > 0:
            print(f"‚úÖ IC50 data already present: {non_null_ic50:,} mapped entries")
            print("   Skipping IC50 integration...")
        else:
            print("‚ö†Ô∏è  IC50 column exists but empty - proceeding with integration...")
    else:
        print("üìÇ Loading IC50 mapping data...")
        
        # Define the path to IC50 mapping file
        ic50_file = "/media/onur/Elements/cavity_space_consensus_docking/2025_06_29_batch_dock/ic50_mapping.csv"
        
        try:
            # Load IC50 mapping data
            ic50_df = pl.read_csv(ic50_file)
            
            print(f"   Loaded IC50 data: {ic50_df.shape}")
            print(f"   Columns: {ic50_df.columns}")
            print(f"\nüìä IC50 Data Summary:")
            print(f"   Unique drugs: {ic50_df['drugbank_id'].n_unique():,}")
            print(f"   Unique targets: {ic50_df['uniprot_id'].n_unique():,}")
            print(f"   Total measurements: {len(ic50_df):,}")
            
            # Display measurement type distribution
            measurement_counts = ic50_df.group_by('measurement_type').len().sort('len', descending=True)
            print(f"\n   Measurement types:")
            for row in measurement_counts.iter_rows(named=True):
                print(f"     {row['measurement_type']}: {row['len']:,}")
            
            # Display operator distribution
            operator_counts = ic50_df.group_by('operator').len().sort('len', descending=True)
            print(f"\n   Operators:")
            for row in operator_counts.iter_rows(named=True):
                print(f"     {row['operator']}: {row['len']:,}")
            
            # Check for duplicates in IC50 data
            ic50_unique_pairs = ic50_df.select(['drugbank_id', 'uniprot_id']).unique().height
            ic50_total = ic50_df.height
            print(f"\n‚ö†Ô∏è  IC50 Data Duplication Check:")
            print(f"   Total IC50 records: {ic50_total:,}")
            print(f"   Unique drug-target pairs: {ic50_unique_pairs:,}")
            if ic50_total > ic50_unique_pairs:
                print(f"   ‚ö†Ô∏è  Found {ic50_total - ic50_unique_pairs:,} duplicate pairs (multiple measurements)")
                print(f"   Strategy: Will aggregate to keep best (lowest) IC50 value per pair")
            
            # Aggregate IC50 data to handle duplicates - keep the lowest IC50 value per drug-target pair
            # This represents the best binding affinity measurement
            ic50_aggregated = ic50_df.group_by(['drugbank_id', 'uniprot_id']).agg([
                pl.col('ic50_value').min().alias('ic50_value'),  # Lowest (best) IC50
                pl.col('ic50_unit').first().alias('ic50_unit'),
                pl.col('measurement_type').first().alias('measurement_type'),
                pl.col('operator').first().alias('operator'),
                pl.col('pubchem_cid').first().alias('pubchem_cid'),
                pl.col('activity').first().alias('activity'),
                pl.len().alias('n_measurements')  # Track how many measurements were aggregated
            ])
            
            print(f"   After aggregation: {ic50_aggregated.height:,} unique drug-target pairs")
            
            # Merge IC50 data with combined results
            # Match on both drugbank_id and uniprot_id
            print(f"\nüîó Merging IC50 data with docking results...")
            before_merge = combined_results.shape
            
            combined_results = combined_results.join(
                ic50_aggregated,
                left_on=['drugbank_id', 'uniprot_id'],
                right_on=['drugbank_id', 'uniprot_id'],
                how='left'
            )
            
            after_merge = combined_results.shape
            matched_ic50 = combined_results['ic50_value'].drop_nulls().len()
            
            print(f"   Before merge: {before_merge}")
            print(f"   After merge:  {after_merge}")
            print(f"   ‚úÖ Matched {matched_ic50:,} docking results with IC50 data")
            print(f"   Coverage: {matched_ic50/len(combined_results)*100:.2f}% of docking results")
            
            # Show some examples of matched data
            if matched_ic50 > 0:
                print(f"\nüìã Sample IC50-matched entries:")
                # Use available columns - check which ones exist
                display_cols = ['drugbank_id', 'uniprot_id', 'ic50_value', 'ic50_unit', 'measurement_type', 'operator', 'n_measurements']
                
                # Add tool columns if they exist
                if 'Tool1' in combined_results.columns:
                    display_cols.insert(2, 'Tool1')
                if 'Tool2' in combined_results.columns:
                    display_cols.insert(3, 'Tool2')
                
                # Add score columns if they exist
                if 'Score1' in combined_results.columns:
                    display_cols.insert(-4, 'Score1')
                if 'Score2' in combined_results.columns:
                    display_cols.insert(-4, 'Score2')
                
                sample = combined_results.filter(
                    pl.col('ic50_value').is_not_null()
                ).select(display_cols).head(5)
                print(sample)
                
                # Show IC50 value statistics for matched entries
                ic50_stats = combined_results.filter(
                    pl.col('ic50_value').is_not_null()
                ).select(pl.col('ic50_value')).describe()
                print(f"\nüìà IC50 Value Statistics (nM):")
                print(ic50_stats)
            else:
                print(f"\n‚ö†Ô∏è  No IC50 values matched with docking results")
                print(f"   This may indicate that the docked drug-target pairs")
                print(f"   don't have experimental measurements in the TTD database")
            
        except FileNotFoundError:
            print(f"‚ùå IC50 file not found: {ic50_file}")
            print("   Skipping IC50 integration...")
            # Add empty columns to maintain schema consistency
            combined_results = combined_results.with_columns([
                pl.lit(None, dtype=pl.Float64).alias('ic50_value'),
                pl.lit(None, dtype=pl.Utf8).alias('ic50_unit'),
                pl.lit(None, dtype=pl.Utf8).alias('measurement_type'),
                pl.lit(None, dtype=pl.Utf8).alias('operator'),
                pl.lit(None, dtype=pl.Utf8).alias('pubchem_cid'),
                pl.lit(None, dtype=pl.Utf8).alias('activity'),
                pl.lit(None, dtype=pl.Int64).alias('n_measurements')
            ])
        except Exception as e:
            print(f"‚ùå Error loading IC50 data: {e}")
            print("   Skipping IC50 integration...")
            # Add empty columns to maintain schema consistency
            combined_results = combined_results.with_columns([
                pl.lit(None, dtype=pl.Float64).alias('ic50_value'),
                pl.lit(None, dtype=pl.Utf8).alias('ic50_unit'),
                pl.lit(None, dtype=pl.Utf8).alias('measurement_type'),
                pl.lit(None, dtype=pl.Utf8).alias('operator'),
                pl.lit(None, dtype=pl.Utf8).alias('activity'),
                pl.lit(None, dtype=pl.Int64).alias('n_measurements')
            ])
    
    print(f"üéØ Ready for analysis with shape: {combined_results.shape}")
    
else:
    print("‚ùå No data available for IC50 integration")

    print("   Please load data first (Step 1)")

## üè∑Ô∏è Step 2.5: Sample Type Annotation (Positive vs Negative)

**Critical Context:** The docking results include both:
- **Positive samples**: Known drug-target interactions from validated databases
- **Negative samples**: Randomly generated drug-target pairs (controls) to test specificity

This annotation is essential for proper evaluation:
- It enables us to assess how well docking tools distinguish true interactions from random pairings
- All subsequent analyses must account for sample type to avoid conflating signal with noise
- Performance metrics (ROC, precision-recall) require this ground truth labeling

We'll load sample type information from `required_structures_with_negatives.csv` and merge it into our dataset based on UniProt ID, DrugBank ID, and cavity index.

**‚ö†Ô∏è Important:** This step must be run BEFORE Step 2.6 (filtering) to enable balanced sample filtering.

In [None]:
# =============================================================================
# üè∑Ô∏è SAMPLE TYPE ANNOTATION (POSITIVE VS NEGATIVE)
# =============================================================================

if not combined_results.is_empty():
    print("üè∑Ô∏è Starting sample type annotation...")
    
    # Load the sample type metadata
    sample_metadata_file = "/media/onur/Elements/cavity_space_consensus_docking/2025_06_29_batch_dock/required_structures_with_negatives.csv"
    
    try:
        print(f"üìñ Loading sample type metadata from:\n   {sample_metadata_file}")
        
        # Load the metadata file
        sample_metadata = pl.read_csv(sample_metadata_file)
        
        print(f"‚úÖ Loaded {sample_metadata.height:,} rows from metadata file")
        print(f"   Columns: {sample_metadata.columns}")
        
        # Show sample of metadata
        print(f"\nüìã Sample metadata (positive samples):")
        print(sample_metadata.filter(pl.col('sample_type') == 'positive').head(3))
        print(f"\nüìã Sample metadata (negative samples):")
        print(sample_metadata.filter(pl.col('sample_type').str.contains('negative')).head(3))
        
        # Check sample type distribution in metadata
        sample_type_counts = sample_metadata.group_by('sample_type').agg(pl.len()).sort('sample_type')
        print(f"\nüìä Sample type distribution in metadata:")
        print(sample_type_counts)
        
        # Prepare metadata for merging - select relevant columns
        # Note: The metadata uses 'UniProt_ID' and 'Cavity_Index', while combined_results uses 'uniprot_id' and 'extracted_cavity_index'
        merge_metadata = sample_metadata.select([
            pl.col('UniProt_ID').alias('uniprot_id'),
            'drugbank_id',
            pl.col('Cavity_Index').alias('cavity_index'),
            'sample_type',
            'Gene_Name'  # Keep this for additional context
        ])
        
        print(f"\nüîÑ Merging sample type information...")
        print(f"   Merge keys: uniprot_id, drugbank_id, cavity_index")
        
        # Check if we have the required columns in combined_results
        if 'extracted_cavity_index' in combined_results.columns:
            # Use extracted_cavity_index for merging
            combined_results = combined_results.with_columns([
                pl.col('extracted_cavity_index').alias('cavity_index')
            ])
        elif 'cavity_index' not in combined_results.columns:
            print("‚ö†Ô∏è  Warning: No cavity_index column found in combined_results!")
            print("   This may affect merge accuracy.")
        
        # Before merge - check data availability
        pre_merge_rows = combined_results.height
        unique_pairs_in_data = combined_results.select(['uniprot_id', 'drugbank_id', 'cavity_index']).unique().height
        unique_pairs_in_metadata = merge_metadata.select(['uniprot_id', 'drugbank_id', 'cavity_index']).unique().height
        
        print(f"\nüìä Pre-merge statistics:")
        print(f"   Combined_results rows: {pre_merge_rows:,}")
        print(f"   Unique (uniprot, drug, cavity) in data: {unique_pairs_in_data:,}")
        print(f"   Unique (uniprot, drug, cavity) in metadata: {unique_pairs_in_metadata:,}")
        
        # Perform left join to add sample_type to combined_results
        combined_results = combined_results.join(
            merge_metadata,
            on=['uniprot_id', 'drugbank_id', 'cavity_index'],
            how='left'
        )
        
        # Check merge results
        post_merge_rows = combined_results.height
        annotated_rows = combined_results.filter(pl.col('sample_type').is_not_null()).height
        
        print(f"\n‚úÖ Merge completed:")
        print(f"   Rows after merge: {post_merge_rows:,}")
        print(f"   Rows with sample_type: {annotated_rows:,} ({annotated_rows/post_merge_rows*100:.1f}%)")
        
        if annotated_rows < post_merge_rows:
            unannotated_rows = post_merge_rows - annotated_rows
            print(f"   ‚ö†Ô∏è  Unannotated rows: {unannotated_rows:,} ({unannotated_rows/post_merge_rows*100:.1f}%)")
            
            # Show sample of unannotated data for debugging
            print(f"\nüìã Sample unannotated data (first 3 rows):")
            sample_unmapped = combined_results.filter(pl.col('sample_type').is_null()).select([
                'uniprot_id', 'drugbank_id', 'cavity_index', 'source_dir'
            ]).head(3)
            print(sample_unmapped)
        
        # Show sample type distribution in annotated data
        if annotated_rows > 0:
            annotated_type_counts = combined_results.filter(
                pl.col('sample_type').is_not_null()
            ).group_by('sample_type').agg(pl.len()).sort('sample_type')
            
            print(f"\nüìä Sample type distribution in annotated docking results:")
            print(annotated_type_counts)
            
            # Show sample of annotated data
            print(f"\nüìã Sample annotated data (positive):")
            sample_mapped = combined_results.filter(
                pl.col('sample_type') == 'positive'
            ).select(['uniprot_id', 'drugbank_id', 'cavity_index', 'sample_type', 'Gene_Name']).head(3)
            print(sample_mapped)
            
            print(f"\nüìã Sample annotated data (negative):")
            sample_mapped = combined_results.filter(
                pl.col('sample_type').str.contains('negative')
            ).select(['uniprot_id', 'drugbank_id', 'cavity_index', 'sample_type', 'Gene_Name']).head(3)
            print(sample_mapped)
        
        print(f"\n‚úÖ Sample type annotation complete!")
        print(f"   Dataset now includes 'sample_type' column for positive/negative discrimination")
        
    except FileNotFoundError:
        print(f"‚ùå Error: Could not find sample metadata file:")
        print(f"   {sample_metadata_file}")
        print(f"   Sample type annotation skipped.")
    except Exception as e:
        print(f"‚ùå Error during sample type annotation: {e}")
        print(f"   Sample type annotation skipped.")
        
else:
    print("‚ùå No data available for sample type annotation")
    print("   Please load data first (Step 1 & 2)")

## üîç Step 2.6: Filter for Complete Tool Coverage and Balanced Samples

**Important Filtering Criteria:**

1. **Complete Tool Coverage**: Only include drug-target pairs where **ALL THREE tools** (Gold, Smina, LeDock) made predictions. This eliminates bias from partial tool coverage.

2. **Balanced Sample Types**: For each drug, ensure equal representation of positive and negative samples:
   - Filter out drugs that have only positive OR only negative samples
   - Keep only drugs with both sample types present
   - Balance the counts to have equal numbers of positive and negative samples per drug

**Prerequisites:** 
- Step 2.5 (Sample Type Annotation) must be completed first
- The `sample_type` column must be present in the dataset

These filtering steps ensure fair comparison between tools and valid positive vs negative sample discrimination analyses.

In [None]:
# =============================================================================
# üîç FILTER FOR COMPLETE TOOL COVERAGE AND BALANCED SAMPLES
# =============================================================================

if not combined_results.is_empty():
    print("üîç Starting filtering process...")
    print("   Criteria: (1) Complete tool coverage, (2) Balanced positive/negative samples")
    
    # STEP 1: Filter for final_results source_type only
    print("\nüìã Step 1: Filtering for final_results source_type...")
    original_rows = combined_results.height
    
    if 'source_type' in combined_results.columns:
        # Check what source_type values we have
        source_types = combined_results['source_type'].unique().to_list()
        print(f"   Available source_types: {source_types}")
        
        # Filter for final_results only
        combined_results = combined_results.filter(pl.col('source_type') == 'final_results')
        filtered_rows = combined_results.height
        
        print(f"   Original rows: {original_rows:,}")
        print(f"   After final_results filter: {filtered_rows:,} ({filtered_rows/original_rows*100:.1f}%)")
    else:
        print("   ‚ö†Ô∏è No source_type column found, skipping source_type filtering")
    
    # STEP 2: Analyze tool coverage
    print("\nüìã Step 2: Analyzing tool coverage...")
    
    if 'Tool1' in combined_results.columns and 'Tool2' in combined_results.columns:
        # Filter out null values from tool lists
        tool1_list = combined_results.filter(pl.col('Tool1').is_not_null())['Tool1'].unique().to_list()
        tool2_list = combined_results.filter(pl.col('Tool2').is_not_null())['Tool2'].unique().to_list()
        
        # Combine and sort, excluding any None values
        all_tools = tool1_list + tool2_list
        all_detected_tools = sorted([tool for tool in set(all_tools) if tool is not None])
        
        print(f"   All detected tools: {all_detected_tools}")
        
        # Define the three main tools we expect
        expected_tools = ['GOLD', 'Smina', 'LeDock']
        available_expected_tools = [tool for tool in expected_tools if tool in all_detected_tools]
        
        print(f"   Expected tools found: {available_expected_tools}")
        
        if len(available_expected_tools) >= 2:  # Need at least 2 tools for comparison
            print(f"\nüìä Step 3: Checking tool coverage per drug-target pair...")
            
            # Group by drug-target pairs and check tool coverage
            drug_target_groups = combined_results.group_by(['drugbank_id', 'uniprot_id'])
            
            complete_coverage_pairs = []
            coverage_summary = []
            
            for group_key, group_data in drug_target_groups:
                drug = group_key[0]
                target = group_key[1]
                
                # Get unique tools that made predictions for this drug-target pair
                tools_t1 = group_data.filter(pl.col('Tool1').is_not_null())['Tool1'].unique().to_list()
                tools_t2 = group_data.filter(pl.col('Tool2').is_not_null())['Tool2'].unique().to_list()
                tools_in_group = set(tools_t1 + tools_t2)
                tools_present = [tool for tool in available_expected_tools if tool in tools_in_group]
                
                coverage_summary.append({
                    'drug': drug,
                    'target': target,
                    'tools_present': tools_present,
                    'n_tools': len(tools_present),
                    'complete_coverage': len(tools_present) == len(available_expected_tools),
                    'original_rows': group_data.height
                })
                
                # If all expected tools are present, keep this drug-target pair
                if len(tools_present) == len(available_expected_tools):
                    complete_coverage_pairs.append((drug, target))
            
            # STEP 3: Filter for complete coverage pairs
            if complete_coverage_pairs:
                print(f"   Found {len(complete_coverage_pairs):,} pairs with complete tool coverage")
                
                # Create filter for complete coverage pairs using Polars syntax
                complete_filter = pl.lit(False)  # Start with False
                
                for drug, target in complete_coverage_pairs:
                    pair_filter = (pl.col('drugbank_id') == drug) & (pl.col('uniprot_id') == target)
                    complete_filter = complete_filter | pair_filter
                
                # Apply the filter
                combined_results = combined_results.filter(complete_filter)
                final_rows = combined_results.height
                
                # Report filtering results
                original_pairs = len(coverage_summary)
                complete_pairs = len(complete_coverage_pairs)
                
                print(f"\nüìà FILTERING RESULTS:")
                print("=" * 40)
                print(f"üìä Original drug-target pairs: {original_pairs:,}")
                print(f"‚úÖ Complete coverage pairs: {complete_pairs:,} ({complete_pairs/original_pairs*100:.1f}%)")
                print(f"üìã After source_type filter: {filtered_rows:,}")
                print(f"üîÑ Final filtered data: {final_rows:,} ({final_rows/filtered_rows*100:.1f}%)")
                
                # Tool coverage distribution
                coverage_dist = {}
                for item in coverage_summary:
                    n_tools = item['n_tools']
                    coverage_dist[n_tools] = coverage_dist.get(n_tools, 0) + 1
                
                print(f"\nüéØ TOOL COVERAGE DISTRIBUTION:")
                for n_tools in sorted(coverage_dist.keys(), reverse=True):
                    count = coverage_dist[n_tools]
                    pct = count / original_pairs * 100
                    print(f"   {n_tools} tools: {count:,} pairs ({pct:.1f}%)")
                
                print(f"\n‚úÖ Dataset filtered for fair tool comparison")
                print(f"   Only using drug-target pairs where ALL {len(available_expected_tools)} expected tools made predictions")
                
                # STEP 4: Filter for balanced positive/negative samples per drug
                print(f"\nüìã Step 4: Filtering for balanced positive/negative samples...")
                
                if 'sample_type' in combined_results.columns:
                    # Check if we have sample type information
                    sample_type_counts = combined_results.filter(
                        pl.col('sample_type').is_not_null()
                    ).group_by('sample_type').agg(pl.len()).sort('sample_type')
                    
                    print(f"   Sample type distribution before balancing:")
                    for row in sample_type_counts.iter_rows(named=True):
                        print(f"      {row['sample_type']}: {row['len']:,}")
                    
                    # Identify positive and negative sample types
                    has_positive = combined_results.filter(
                        pl.col('sample_type') == 'positive'
                    ).height > 0
                    
                    has_negative = combined_results.filter(
                        pl.col('sample_type').str.contains('negative')
                    ).height > 0
                    
                    if has_positive and has_negative:
                        print(f"   ‚úÖ Both positive and negative samples found")
                        
                        # Group by drug and count positive/negative samples
                        # We'll count unique (drug, target, cavity) combinations per sample type
                        drug_sample_counts = combined_results.filter(
                            pl.col('sample_type').is_not_null()
                        ).group_by(['drugbank_id', 'sample_type']).agg([
                            pl.col('uniprot_id').n_unique().alias('n_targets'),
                            pl.col('cavity_index').n_unique().alias('n_cavities'),
                            pl.len().alias('n_records')
                        ])
                        
                        # For each drug, check if it has both positive and negative samples
                        drugs_with_both = {}
                        
                        for drug in combined_results['drugbank_id'].unique().to_list():
                            drug_samples = drug_sample_counts.filter(
                                pl.col('drugbank_id') == drug
                            )
                            
                            pos_count = drug_samples.filter(
                                pl.col('sample_type') == 'positive'
                            )
                            neg_count = drug_samples.filter(
                                pl.col('sample_type').str.contains('negative')
                            )
                            
                            has_pos = pos_count.height > 0
                            has_neg = neg_count.height > 0
                            
                            if has_pos and has_neg:
                                # Get the counts of unique (target, cavity) combinations
                                n_pos = pos_count.select('n_records').sum().item() if has_pos else 0
                                n_neg = neg_count.select('n_records').sum().item() if has_neg else 0
                                
                                drugs_with_both[drug] = {
                                    'n_positive': n_pos,
                                    'n_negative': n_neg,
                                    'min_count': min(n_pos, n_neg)
                                }
                        
                        print(f"   Drugs with both sample types: {len(drugs_with_both):,}")
                        print(f"   Drugs filtered out (only one sample type): {combined_results['drugbank_id'].n_unique() - len(drugs_with_both):,}")
                        
                        if len(drugs_with_both) > 0:
                            # Filter for drugs with both sample types
                            valid_drugs = list(drugs_with_both.keys())
                            combined_results = combined_results.filter(
                                pl.col('drugbank_id').is_in(valid_drugs)
                            )
                            
                            after_drug_filter = combined_results.height
                            print(f"   Rows after drug filtering: {after_drug_filter:,}")
                            
                            # Now balance the samples for each drug
                            print(f"\n   Balancing positive/negative samples for each drug...")
                            
                            balanced_chunks = []
                            total_pos_kept = 0
                            total_neg_kept = 0
                            
                            for drug, counts in drugs_with_both.items():
                                min_count = counts['min_count']
                                
                                # Get positive samples for this drug
                                pos_samples = combined_results.filter(
                                    (pl.col('drugbank_id') == drug) &
                                    (pl.col('sample_type') == 'positive')
                                )
                                
                                # Get negative samples for this drug
                                neg_samples = combined_results.filter(
                                    (pl.col('drugbank_id') == drug) &
                                    (pl.col('sample_type').str.contains('negative'))
                                )
                                
                                # Take equal number from each (limited by minimum)
                                if pos_samples.height > min_count:
                                    pos_samples = pos_samples.sample(n=min_count, seed=42)
                                if neg_samples.height > min_count:
                                    neg_samples = neg_samples.sample(n=min_count, seed=42)
                                
                                balanced_chunks.append(pos_samples)
                                balanced_chunks.append(neg_samples)
                                
                                total_pos_kept += pos_samples.height
                                total_neg_kept += neg_samples.height
                            
                            # Combine balanced chunks
                            if balanced_chunks:
                                combined_results = pl.concat(balanced_chunks, how='vertical')
                                balanced_rows = combined_results.height
                                
                                print(f"   ‚úÖ Balanced sampling complete:")
                                print(f"      Positive samples kept: {total_pos_kept:,}")
                                print(f"      Negative samples kept: {total_neg_kept:,}")
                                print(f"      Total rows: {balanced_rows:,}")
                                print(f"      Balance ratio: {total_pos_kept/total_neg_kept:.3f}" if total_neg_kept > 0 else "      Balance ratio: N/A")
                                
                                # Verify final balance
                                final_sample_counts = combined_results.group_by('sample_type').agg(pl.len()).sort('sample_type')
                                print(f"\n   üìä Final sample type distribution:")
                                for row in final_sample_counts.iter_rows(named=True):
                                    print(f"      {row['sample_type']}: {row['len']:,}")
                            else:
                                print(f"   ‚ö†Ô∏è No balanced data could be created")
                        else:
                            print(f"   ‚ö†Ô∏è No drugs have both positive and negative samples")
                            print(f"      Skipping balanced sampling")
                    elif has_positive:
                        print(f"   ‚ö†Ô∏è Only positive samples found - cannot balance")
                    elif has_negative:
                        print(f"   ‚ö†Ô∏è Only negative samples found - cannot balance")
                    else:
                        print(f"   ‚ö†Ô∏è No valid sample type information found")
                else:
                    print(f"   ‚ö†Ô∏è No sample_type column found - skipping balanced sampling")
                    print(f"      Run sample type annotation (Step 2.6) first for balanced sampling")
                
            else:
                print("‚ùå No drug-target pairs have complete tool coverage!")
                print("   Cannot proceed with fair comparison analysis")
        else:
            print(f"‚ùå Not enough tools found ({len(available_expected_tools)} < 2)")
    else:
        print("‚ùå Tool1 or Tool2 columns not found")
        
    print(f"\nüéØ Ready for analysis with shape: {combined_results.shape}")
    
else:
    print("‚ùå No data available for filtering")
    print("   Please load data first (Step 1)")

## üìä Step 3: Data Overview & Quality Check

Get familiar with the **filtered** dataset structure and check data quality. This step now analyzes the data after filtering for complete tool coverage, ensuring all statistics reflect the dataset used for analysis.

In [None]:
# =============================================================================
# üìä STEP 3: DATA OVERVIEW & QUALITY CHECK
# =============================================================================

if not combined_results.is_empty():
    print("üîç DATASET OVERVIEW")
    print("=" * 50)
    print(f"üìä Shape: {combined_results.shape} (rows √ó columns)")
    print(f"üíæ Memory: {combined_results.estimated_size() / (1024*1024):.1f} MB")
    print(f"üìã Columns: {combined_results.width}")
    
    print(f"\nüìö Column Names:")
    for i, col in enumerate(combined_results.columns, 1):
        print(f"  {i:2d}. {col}")
    
    print(f"\nüß¨ KEY DATASET STATISTICS")
    print("=" * 50)
    
    # Core identifiers
    print(f"üî¨ Unique Drug-Target Combinations: {combined_results.select(['drugbank_id', 'uniprot_id']).unique().height:,}")
    print(f"üíä Unique Drugs (DrugBank IDs): {combined_results['drugbank_id'].n_unique():,}")
    print(f"üß¨ Unique Proteins (UniProt IDs): {combined_results['uniprot_id'].n_unique():,}")
    
    # Check for cavity index information
    if 'extracted_cavity_index' in combined_results.columns:
        print(f"üï≥Ô∏è  Unique Cavities: {combined_results['extracted_cavity_index'].n_unique():,}")
    elif 'cavity_index' in combined_results.columns:
        print(f"üï≥Ô∏è  Unique Cavities: {combined_results['cavity_index'].n_unique():,}")
    else:
        print("üï≥Ô∏è  Cavity information: Not available")
    
    # Cluster information
    if 'cavity_cluster_id' in combined_results.columns:
        cluster_mapped = combined_results['cavity_cluster_id'].drop_nulls().len()
        cluster_total = combined_results.height
        unique_clusters = combined_results['cavity_cluster_id'].n_unique()
        print(f"üß© Cavity Clusters: {unique_clusters:,} unique clusters")
        print(f"   Mapped: {cluster_mapped:,}/{cluster_total:,} ({cluster_mapped/cluster_total*100:.1f}%)")
    
    # Check for essential analysis columns
    print(f"\nüîç DATA QUALITY ASSESSMENT")
    print("=" * 50)
    
    # Check for RMSD columns
    rmsd_columns = [col for col in combined_results.columns if 'rmsd' in col.lower()]
    if rmsd_columns:
        rmsd_col = rmsd_columns[0]
        print(f"‚úÖ RMSD data available: {rmsd_col}")
        
        # RMSD statistics
        rmsd_stats = combined_results.select([
            pl.col(rmsd_col).min().alias('min_rmsd'),
            pl.col(rmsd_col).max().alias('max_rmsd'),
            pl.col(rmsd_col).mean().alias('mean_rmsd'),
            pl.col(rmsd_col).median().alias('median_rmsd'),
            (pl.col(rmsd_col) < 2.0).mean().alias('good_poses_pct')
        ]).to_pandas().iloc[0]
        
        print(f"   Range: {rmsd_stats['min_rmsd']:.2f} - {rmsd_stats['max_rmsd']:.2f} √Ö")
        print(f"   Mean: {rmsd_stats['mean_rmsd']:.2f} √Ö, Median: {rmsd_stats['median_rmsd']:.2f} √Ö")
        print(f"   Good poses (RMSD < 2.0 √Ö): {rmsd_stats['good_poses_pct']*100:.1f}%")
    else:
        print("‚ö†Ô∏è  No RMSD columns found - pose consistency analysis may be limited")
    
    # Check for score columns
    score_columns = [col for col in combined_results.columns if 'score' in col.lower()]
    print(f"‚úÖ Score columns available: {len(score_columns)}")
    for col in score_columns[:5]:  # Show first 5 score columns
        print(f"   - {col}")
    if len(score_columns) > 5:
        print(f"   ... and {len(score_columns) - 5} more")
    
    # Tool information
    if 'Tool1' in combined_results.columns and 'Tool2' in combined_results.columns:
        tool1_unique = combined_results.filter(pl.col('Tool1').is_not_null())['Tool1'].n_unique()
        tool2_unique = combined_results.filter(pl.col('Tool2').is_not_null())['Tool2'].n_unique()
        print(f"üîß Tool1 variants: {tool1_unique}")
        print(f"üîß Tool2 variants: {tool2_unique}")
    
    # Source information
    if 'source_type' in combined_results.columns:
        source_types = combined_results['source_type'].value_counts().to_pandas()
        print(f"üìÅ Source types:")
        for _, row in source_types.iterrows():
            print(f"   - {row['source_type']}: {row['count']:,} rows")
    
    print(f"\n‚úÖ Dataset quality check complete")
    print(f"üéØ Ready for consensus analysis!")
    
else:
    print("‚ùå No data available for overview")
    print("   Please load and filter data first")

## üíæ Step 4: Save Prepared Data for Analysis

**Purpose:** Save the filtered, annotated, and quality-checked dataset for later use.

This checkpoint preserves the processed data after:
- ‚úÖ Data loading and cluster integration (Steps 1-2)
- ‚úÖ Sample type annotation (Step 2.5)
- ‚úÖ Complete tool coverage filtering (Step 2.6)
- ‚úÖ Balanced positive/negative sampling (Step 2.6)
- ‚úÖ Data quality verification (Step 3)

**Output:** A clean, analysis-ready Parquet file containing:
- All consensus docking results with complete tool coverage
- Balanced positive and negative samples
- Annotated sample types and metadata
- Integrated cavity cluster information

This file can be reloaded in future sessions to skip preprocessing steps and jump directly to analysis.

In [None]:
combined_results.write_parquet("/media/onur/Elements/cavity_space_consensus_docking/2025_06_29_batch_dock/combined_filtered_annotated_docking_results.parquet")