# Tutorial 2: Batch Processing and Variant Comparison

This notebook demonstrates how to analyze multiple SDF files in batch mode and compare results across different R-group variants.

## Learning Objectives

By the end of this tutorial, you will be able to:
1. Process multiple files in batch mode
2. Use parallel processing for faster analysis
3. Compare geometric properties across variants
4. Generate comparative visualizations

## 1. Setup

In [None]:
from pyrene_analyzer import PyreneDimerAnalyzer
from pyrene_analyzer.visualization import (
    plot_variant_comparison,
    plot_distance_vs_overlap
)

import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

%matplotlib inline

## 2. Define Input Files

Replace these paths with your actual SDF files for different R-group variants.

In [None]:
# Example: List of files for different R-group variants
# Replace with your actual file paths
input_files = [
    'Et_conformers.sdf',
    'iPr_conformers.sdf',
    'cHex_conformers.sdf',
    'tBu_conformers.sdf',
]

# Filter to only existing files
existing_files = [f for f in input_files if Path(f).exists()]
print(f"Found {len(existing_files)} files")

## 3. Batch Analysis

In [None]:
# Initialize analyzer
analyzer = PyreneDimerAnalyzer(verbose=True)

# Run batch analysis with parallel processing
# n_jobs=-1 uses all available CPU cores
if existing_files:
    results_df = analyzer.batch_analyze(existing_files, n_jobs=-1)
    print(f"\nTotal conformers analyzed: {len(results_df)}")
else:
    # Create sample data for demonstration
    import numpy as np
    np.random.seed(42)
    
    dfs = []
    for variant in ['Et', 'iPr', 'cHex', 'tBu']:
        n = 50
        df = pd.DataFrame({
            'molecule': [variant] * n,
            'conformer_id': range(n),
            'plane_angle_deg': np.random.uniform(10, 80, n),
            'interplane_distance_A': np.random.uniform(3.2, 5.0, n),
            'pi_overlap_pct': np.random.uniform(10, 90, n),
        })
        dfs.append(df)
    results_df = pd.concat(dfs, ignore_index=True)
    print("Using sample data for demonstration")

## 4. Summary Statistics by Variant

In [None]:
# Group by molecule and calculate statistics
summary = results_df.groupby('molecule').agg({
    'plane_angle_deg': ['mean', 'std', 'min', 'max'],
    'interplane_distance_A': ['mean', 'std', 'min', 'max'],
    'pi_overlap_pct': ['mean', 'std', 'min', 'max'],
})

summary

## 5. Variant Comparison Plots

In [None]:
# Compare plane angles across variants
fig = plot_variant_comparison(
    results_df,
    parameter='plane_angle_deg'
)
plt.show()

In [None]:
# Compare inter-plane distances
fig = plot_variant_comparison(
    results_df,
    parameter='interplane_distance_A'
)
plt.show()

In [None]:
# Distance vs Overlap colored by variant
fig = plot_distance_vs_overlap(
    results_df,
    color_by='molecule'
)
plt.show()

## 6. Classification Analysis

In [None]:
# Add classification
results_df = analyzer.add_classification(results_df)

# Cross-tabulation of variant vs classification
pd.crosstab(
    results_df['molecule'],
    results_df['classification'],
    normalize='index'
) * 100  # Convert to percentages

## Summary

In this tutorial, we learned how to:
- Process multiple files using batch analysis
- Compare geometric properties across R-group variants
- Generate comparative visualizations
- Analyze classification distributions

The next tutorial will cover advanced visualization and QSAR workflows.