# BACH Dataset - Exploratory Data Analysis

This notebook provides a comprehensive analysis of the BACH (Breast Cancer Histology) dataset.

## Dataset Overview
- **BACH**: Breast Cancer Histology Challenge dataset
- **Classes**: 4 (Normal, Benign, In Situ Carcinoma, Invasive Carcinoma)
- **Resolution**: High-resolution (2048 x 1536 pixels)
- **Total Images**: 400 (100 per class)
- **Source**: ICIAR 2018 Challenge

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import cv2
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Add src to path
sys.path.append('../src')
from bach_data_utils import create_bach_metadata

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Constants
BACH_ROOT = '../data/bach'
FIGSIZE = (12, 8)

## Data Setup

First, let's check if BACH data is available and download if needed.

In [None]:
# Check if BACH data exists
if not os.path.exists(BACH_ROOT):
    print("BACH dataset not found. Please download it first.")
    print("\nDownload options:")
    print("1. Official: https://iciar2018-challenge.grand-challenge.org/Dataset/")
    print("2. Kaggle: https://www.kaggle.com/datasets/paultimothymooney/breast-histopathology-images")
    print("\nExpected structure:")
    print("data/bach/")
    print("├── Normal/")
    print("├── Benign/")
    print("├── InSitu/")
    print("└── Invasive/")
else:
    print(f"BACH dataset found at: {BACH_ROOT}")
    
    # List directory structure
    for root, dirs, files in os.walk(BACH_ROOT):
        level = root.replace(BACH_ROOT, '').count(os.sep)
        indent = ' ' * 2 * level
        print(f"{indent}{os.path.basename(root)}/")
        subindent = ' ' * 2 * (level + 1)
        for file in files[:3]:  # Show first 3 files
            print(f"{subindent}{file}")
        if len(files) > 3:
            print(f"{subindent}... and {len(files)-3} more files")

## 1. Dataset Loading and Basic Statistics

In [None]:
# Load BACH metadata
if os.path.exists(BACH_ROOT):
    bach_df = create_bach_metadata(BACH_ROOT)
    print(f"Total images: {len(bach_df)}")
    print(f"\nDataset info:")
    print(bach_df.info())
    print(f"\nFirst few rows:")
    display(bach_df.head())
else:
    print("Please download BACH dataset first to proceed with EDA.")

## 2. Class Distribution Analysis

In [None]:
if os.path.exists(BACH_ROOT):
    # Class distribution
    class_counts = bach_df['class'].value_counts()
    print("Class Distribution:")
    print(class_counts)
    
    # Visualization
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # Bar plot
    class_counts.plot(kind='bar', ax=axes[0], color='skyblue')
    axes[0].set_title('BACH Dataset - Class Distribution')
    axes[0].set_xlabel('Class')
    axes[0].set_ylabel('Number of Images')
    axes[0].tick_params(axis='x', rotation=45)
    
    # Pie chart
    axes[1].pie(class_counts.values, labels=class_counts.index, autopct='%1.1f%%', startangle=90)
    axes[1].set_title('BACH Dataset - Class Proportions')
    
    plt.tight_layout()
    plt.show()
    
    # Class balance analysis
    print(f"\nClass Balance Analysis:")
    print(f"Most common class: {class_counts.index[0]} ({class_counts.iloc[0]} images)")
    print(f"Least common class: {class_counts.index[-1]} ({class_counts.iloc[-1]} images)")
    print(f"Imbalance ratio: {class_counts.iloc[0] / class_counts.iloc[-1]:.2f}:1")

## 3. Image Properties Analysis

In [None]:
if os.path.exists(BACH_ROOT):
    # Analyze image properties
    def analyze_image_properties(df, sample_size=50):
        """Analyze image dimensions, file sizes, and formats"""
        properties = []
        
        # Sample images for analysis
        sample_df = df.sample(min(sample_size, len(df)), random_state=42)
        
        for _, row in sample_df.iterrows():
            try:
                img_path = row['path']
                img = Image.open(img_path)
                
                properties.append({
                    'class': row['class'],
                    'width': img.width,
                    'height': img.height,
                    'aspect_ratio': img.width / img.height,
                    'file_size_mb': os.path.getsize(img_path) / (1024 * 1024),
                    'format': img.format,
                    'mode': img.mode
                })
            except Exception as e:
                print(f"Error processing {img_path}: {e}")
        
        return pd.DataFrame(properties)
    
    # Analyze properties
    props_df = analyze_image_properties(bach_df)
    
    print("Image Properties Summary:")
    print(props_df.describe())
    
    # Visualize dimensions
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # Width distribution
    props_df['width'].hist(bins=20, ax=axes[0,0], alpha=0.7)
    axes[0,0].set_title('Image Width Distribution')
    axes[0,0].set_xlabel('Width (pixels)')
    
    # Height distribution
    props_df['height'].hist(bins=20, ax=axes[0,1], alpha=0.7)
    axes[0,1].set_title('Image Height Distribution')
    axes[0,1].set_xlabel('Height (pixels)')
    
    # Aspect ratio by class
    sns.boxplot(data=props_df, x='class', y='aspect_ratio', ax=axes[1,0])
    axes[1,0].set_title('Aspect Ratio by Class')
    axes[1,0].tick_params(axis='x', rotation=45)
    
    # File size distribution
    props_df['file_size_mb'].hist(bins=20, ax=axes[1,1], alpha=0.7)
    axes[1,1].set_title('File Size Distribution')
    axes[1,1].set_xlabel('File Size (MB)')
    
    plt.tight_layout()
    plt.show()
    
    # Format and mode analysis
    print(f"\nImage Formats: {props_df['format'].value_counts().to_dict()}")
    print(f"Color Modes: {props_df['mode'].value_counts().to_dict()}")

## 4. Sample Images Visualization

In [None]:
if os.path.exists(BACH_ROOT):
    def display_sample_images(df, samples_per_class=2, figsize=(16, 12)):
        """Display sample images from each class"""
        classes = df['class'].unique()
        n_classes = len(classes)
        
        fig, axes = plt.subplots(n_classes, samples_per_class, figsize=figsize)
        if n_classes == 1:
            axes = axes.reshape(1, -1)
        
        for i, class_name in enumerate(classes):
            class_df = df[df['class'] == class_name]
            samples = class_df.sample(min(samples_per_class, len(class_df)), random_state=42)
            
            for j, (_, row) in enumerate(samples.iterrows()):
                if j >= samples_per_class:
                    break
                    
                try:
                    img = Image.open(row['path'])
                    # Resize for display
                    img_resized = img.resize((400, 300), Image.Resampling.LANCZOS)
                    
                    axes[i, j].imshow(img_resized)
                    axes[i, j].set_title(f"{class_name}\n{os.path.basename(row['path'])}")
                    axes[i, j].axis('off')
                except Exception as e:
                    axes[i, j].text(0.5, 0.5, f'Error loading\n{e}', 
                                  ha='center', va='center', transform=axes[i, j].transAxes)
                    axes[i, j].axis('off')
        
        plt.tight_layout()
        plt.show()
    
    print("Sample Images from Each Class:")
    display_sample_images(bach_df)

## 5. Color Analysis

In [None]:
if os.path.exists(BACH_ROOT):
    def analyze_color_properties(df, sample_size=20):
        """Analyze color properties of images"""
        color_stats = []
        
        # Sample images for analysis
        sample_df = df.groupby('class').apply(
            lambda x: x.sample(min(sample_size//len(df['class'].unique()), len(x)), random_state=42)
        ).reset_index(drop=True)
        
        for _, row in sample_df.iterrows():
            try:
                img = Image.open(row['path']).convert('RGB')
                # Resize for faster processing
                img_small = img.resize((224, 224), Image.Resampling.LANCZOS)
                img_array = np.array(img_small)
                
                # Calculate color statistics
                color_stats.append({
                    'class': row['class'],
                    'mean_r': np.mean(img_array[:,:,0]),
                    'mean_g': np.mean(img_array[:,:,1]),
                    'mean_b': np.mean(img_array[:,:,2]),
                    'std_r': np.std(img_array[:,:,0]),
                    'std_g': np.std(img_array[:,:,1]),
                    'std_b': np.std(img_array[:,:,2]),
                    'brightness': np.mean(img_array),
                    'contrast': np.std(img_array)
                })
            except Exception as e:
                print(f"Error processing {row['path']}: {e}")
        
        return pd.DataFrame(color_stats)
    
    # Analyze colors
    color_df = analyze_color_properties(bach_df)
    
    if not color_df.empty:
        # Visualize color properties
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        # RGB means by class
        rgb_means = color_df.groupby('class')[['mean_r', 'mean_g', 'mean_b']].mean()
        rgb_means.plot(kind='bar', ax=axes[0,0])
        axes[0,0].set_title('Average RGB Values by Class')
        axes[0,0].set_ylabel('Mean Pixel Value')
        axes[0,0].tick_params(axis='x', rotation=45)
        axes[0,0].legend(['Red', 'Green', 'Blue'])
        
        # Brightness distribution
        sns.boxplot(data=color_df, x='class', y='brightness', ax=axes[0,1])
        axes[0,1].set_title('Brightness Distribution by Class')
        axes[0,1].tick_params(axis='x', rotation=45)
        
        # Contrast distribution
        sns.boxplot(data=color_df, x='class', y='contrast', ax=axes[1,0])
        axes[1,0].set_title('Contrast Distribution by Class')
        axes[1,0].tick_params(axis='x', rotation=45)
        
        # Color variance
        color_variance = color_df.groupby('class')[['std_r', 'std_g', 'std_b']].mean()
        color_variance.plot(kind='bar', ax=axes[1,1])
        axes[1,1].set_title('Color Variance by Class')
        axes[1,1].set_ylabel('Standard Deviation')
        axes[1,1].tick_params(axis='x', rotation=45)
        axes[1,1].legend(['Red Std', 'Green Std', 'Blue Std'])
        
        plt.tight_layout()
        plt.show()
        
        # Summary statistics
        print("\nColor Statistics Summary:")
        print(color_df.groupby('class')[['brightness', 'contrast']].describe())

## 6. Texture Analysis

In [None]:
if os.path.exists(BACH_ROOT):
    def analyze_texture_features(df, sample_size=10):
        """Analyze texture features using basic image processing"""
        texture_stats = []
        
        # Sample images for analysis
        sample_df = df.groupby('class').apply(
            lambda x: x.sample(min(sample_size//len(df['class'].unique()), len(x)), random_state=42)
        ).reset_index(drop=True)
        
        for _, row in sample_df.iterrows():
            try:
                img = cv2.imread(row['path'])
                if img is None:
                    continue
                    
                # Convert to grayscale
                gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
                # Resize for faster processing
                gray = cv2.resize(gray, (224, 224))
                
                # Calculate texture features
                # Laplacian variance (measure of focus/sharpness)
                laplacian_var = cv2.Laplacian(gray, cv2.CV_64F).var()
                
                # Sobel gradients (edge information)
                sobelx = cv2.Sobel(gray, cv2.CV_64F, 1, 0, ksize=3)
                sobely = cv2.Sobel(gray, cv2.CV_64F, 0, 1, ksize=3)
                sobel_magnitude = np.sqrt(sobelx**2 + sobely**2)
                
                texture_stats.append({
                    'class': row['class'],
                    'laplacian_variance': laplacian_var,
                    'sobel_mean': np.mean(sobel_magnitude),
                    'sobel_std': np.std(sobel_magnitude),
                    'intensity_std': np.std(gray),
                    'intensity_range': np.ptp(gray)
                })
            except Exception as e:
                print(f"Error processing {row['path']}: {e}")
        
        return pd.DataFrame(texture_stats)
    
    # Analyze texture
    texture_df = analyze_texture_features(bach_df)
    
    if not texture_df.empty:
        # Visualize texture properties
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        
        # Laplacian variance (sharpness)
        sns.boxplot(data=texture_df, x='class', y='laplacian_variance', ax=axes[0,0])
        axes[0,0].set_title('Image Sharpness (Laplacian Variance)')
        axes[0,0].tick_params(axis='x', rotation=45)
        
        # Edge strength
        sns.boxplot(data=texture_df, x='class', y='sobel_mean', ax=axes[0,1])
        axes[0,1].set_title('Edge Strength (Sobel Mean)')
        axes[0,1].tick_params(axis='x', rotation=45)
        
        # Intensity variation
        sns.boxplot(data=texture_df, x='class', y='intensity_std', ax=axes[1,0])
        axes[1,0].set_title('Intensity Variation')
        axes[1,0].tick_params(axis='x', rotation=45)
        
        # Intensity range
        sns.boxplot(data=texture_df, x='class', y='intensity_range', ax=axes[1,1])
        axes[1,1].set_title('Intensity Range')
        axes[1,1].tick_params(axis='x', rotation=45)
        
        plt.tight_layout()
        plt.show()
        
        # Summary statistics
        print("\nTexture Statistics Summary:")
        print(texture_df.groupby('class').describe())

## 7. Data Quality Assessment

In [None]:
if os.path.exists(BACH_ROOT):
    def assess_data_quality(df):
        """Assess data quality issues"""
        issues = []
        
        print("Data Quality Assessment:")
        print("=" * 50)
        
        # Check for missing files
        missing_files = 0
        corrupted_files = 0
        
        for _, row in df.iterrows():
            img_path = row['path']
            
            # Check if file exists
            if not os.path.exists(img_path):
                missing_files += 1
                issues.append(f"Missing file: {img_path}")
                continue
            
            # Check if file can be opened
            try:
                img = Image.open(img_path)
                img.verify()  # Verify image integrity
            except Exception as e:
                corrupted_files += 1
                issues.append(f"Corrupted file: {img_path} - {e}")
        
        print(f"Total images: {len(df)}")
        print(f"Missing files: {missing_files}")
        print(f"Corrupted files: {corrupted_files}")
        print(f"Valid images: {len(df) - missing_files - corrupted_files}")
        
        # Check class balance
        class_counts = df['class'].value_counts()
        min_class_size = class_counts.min()
        max_class_size = class_counts.max()
        imbalance_ratio = max_class_size / min_class_size
        
        print(f"\nClass Balance:")
        print(f"Most common class: {max_class_size} images")
        print(f"Least common class: {min_class_size} images")
        print(f"Imbalance ratio: {imbalance_ratio:.2f}:1")
        
        if imbalance_ratio > 2.0:
            issues.append(f"Significant class imbalance detected: {imbalance_ratio:.2f}:1")
        
        # Check for duplicate filenames
        duplicate_names = df['filename'].duplicated().sum()
        if duplicate_names > 0:
            issues.append(f"Found {duplicate_names} duplicate filenames")
        
        print(f"\nData Quality Issues Found: {len(issues)}")
        for issue in issues[:10]:  # Show first 10 issues
            print(f"- {issue}")
        
        if len(issues) > 10:
            print(f"... and {len(issues) - 10} more issues")
        
        return issues
    
    # Assess data quality
    quality_issues = assess_data_quality(bach_df)

## 8. Comparison with BreakHis Dataset

In [None]:
if os.path.exists(BACH_ROOT):
    # Load BreakHis for comparison
    sys.path.append('../src')
    from data_utils import create_metadata
    
    breakhis_root = '../data/breakhis/BreaKHis_v1/BreaKHis_v1/histology_slides/breast'
    
    if os.path.exists(breakhis_root):
        breakhis_df = create_metadata(breakhis_root)
        
        print("Dataset Comparison: BACH vs BreakHis")
        print("=" * 50)
        
        comparison_data = {
            'Metric': ['Total Images', 'Number of Classes', 'Avg Images per Class', 
                      'Min Class Size', 'Max Class Size', 'Imbalance Ratio'],
            'BACH': [
                len(bach_df),
                bach_df['class'].nunique(),
                len(bach_df) / bach_df['class'].nunique(),
                bach_df['class'].value_counts().min(),
                bach_df['class'].value_counts().max(),
                bach_df['class'].value_counts().max() / bach_df['class'].value_counts().min()
            ],
            'BreakHis': [
                len(breakhis_df),
                breakhis_df['subclass'].nunique(),
                len(breakhis_df) / breakhis_df['subclass'].nunique(),
                breakhis_df['subclass'].value_counts().min(),
                breakhis_df['subclass'].value_counts().max(),
                breakhis_df['subclass'].value_counts().max() / breakhis_df['subclass'].value_counts().min()
            ]
        }
        
        comparison_df = pd.DataFrame(comparison_data)
        print(comparison_df.to_string(index=False))
        
        # Visualize comparison
        fig, axes = plt.subplots(1, 2, figsize=(15, 6))
        
        # Dataset sizes
        datasets = ['BACH', 'BreakHis']
        sizes = [len(bach_df), len(breakhis_df)]
        axes[0].bar(datasets, sizes, color=['skyblue', 'lightcoral'])
        axes[0].set_title('Dataset Sizes Comparison')
        axes[0].set_ylabel('Number of Images')
        
        # Number of classes
        classes = [bach_df['class'].nunique(), breakhis_df['subclass'].nunique()]
        axes[1].bar(datasets, classes, color=['lightgreen', 'orange'])
        axes[1].set_title('Number of Classes Comparison')
        axes[1].set_ylabel('Number of Classes')
        
        plt.tight_layout()
        plt.show()
    else:
        print("BreakHis dataset not found for comparison")

## 9. Recommendations for Model Training

In [None]:
if os.path.exists(BACH_ROOT):
    print("BACH Dataset - Training Recommendations")
    print("=" * 50)
    
    # Dataset characteristics
    total_images = len(bach_df)
    n_classes = bach_df['class'].nunique()
    class_counts = bach_df['class'].value_counts()
    imbalance_ratio = class_counts.max() / class_counts.min()
    
    print(f"Dataset Size: {total_images} images")
    print(f"Number of Classes: {n_classes}")
    print(f"Class Imbalance Ratio: {imbalance_ratio:.2f}:1")
    
    print("\nRecommendations:")
    print("-" * 30)
    
    # Data augmentation
    if total_images < 1000:
        print("✓ Use aggressive data augmentation (rotation, flipping, color jittering)")
        print("✓ Consider GAN-based augmentation for synthetic data generation")
    
    # Class imbalance
    if imbalance_ratio > 1.5:
        print("✓ Use weighted sampling or class weights to handle imbalance")
        print("✓ Consider focal loss for better handling of hard examples")
    
    # Model architecture
    print("✓ Use transfer learning with ImageNet pretrained models")
    print("✓ EfficientNet or ResNet architectures recommended for histopathology")
    
    # Training strategy
    print("✓ Use stratified splits to maintain class distribution")
    print("✓ Implement early stopping and learning rate scheduling")
    print("✓ Use cross-validation for robust performance estimation")
    
    # High-resolution considerations
    print("✓ BACH images are high-resolution - consider multi-scale training")
    print("✓ Use progressive resizing: start with smaller images, increase size")
    
    # Evaluation
    print("✓ Use multiple metrics: Accuracy, F1-score, AUC-ROC")
    print("✓ Analyze per-class performance and confusion matrices")
    
    # Combined training
    if os.path.exists('../data/breakhis'):
        print("\n✓ Consider combining with BreakHis dataset for improved generalization")
        print("✓ Use domain adaptation techniques for multi-dataset training")
    
    print("\nSuggested Train/Val/Test Split:")
    print(f"- Training: {int(total_images * 0.7)} images (70%)")
    print(f"- Validation: {int(total_images * 0.15)} images (15%)")
    print(f"- Testing: {int(total_images * 0.15)} images (15%)")

## Summary

This EDA provides a comprehensive analysis of the BACH dataset including:

1. **Dataset Overview**: Class distribution and basic statistics
2. **Image Properties**: Dimensions, file sizes, and formats
3. **Visual Analysis**: Sample images from each class
4. **Color Analysis**: RGB statistics and brightness/contrast patterns
5. **Texture Analysis**: Edge strength and intensity variations
6. **Data Quality**: Assessment of missing or corrupted files
7. **Comparison**: Side-by-side analysis with BreakHis dataset
8. **Recommendations**: Specific suggestions for model training

The BACH dataset provides high-quality histopathology images that complement the BreakHis dataset well for comprehensive breast cancer detection research.