## Quick Test - Data Loading Fix

**This cell tests the updated data loading logic to ensure files are found correctly.**

In [None]:
# Quick test to verify data loading fix
import os
import glob

print("üß™ Testing data loading fix...")
print("=" * 40)

# Test merged_data directory
merged_dir = "merged_data"
if os.path.exists(merged_dir):
    print(f"‚úÖ Found merged_data directory")
    
    # Check subdirectories
    subdirs = [d for d in os.listdir(merged_dir) if os.path.isdir(os.path.join(merged_dir, d))]
    print(f"   Subdirectories: {subdirs}")
    
    # Count files in each subdirectory
    total_files = 0
    for subdir in subdirs:
        subdir_path = os.path.join(merged_dir, subdir)
        png_files = glob.glob(os.path.join(subdir_path, '*.png'))
        jpg_files = glob.glob(os.path.join(subdir_path, '*.jpg'))
        file_count = len(png_files) + len(jpg_files)
        total_files += file_count
        print(f"   {subdir}: {file_count} files ({len(png_files)} PNG, {len(jpg_files)} JPG)")
    
    print(f"   Total files: {total_files}")
    
    if total_files > 0:
        print("‚úÖ Data loading should work!")
    else:
        print("‚ùå No files found")
else:
    print(f"‚ùå merged_data directory not found")
    print(f"Available directories: {[d for d in os.listdir('.') if os.path.isdir(d)]}")

print("\nüöÄ Proceeding with updated notebook...")

# Brain Tumor Classification - Custom CNN Modeling and Evaluation

## Business Objectives

### **Primary Objective**:
> **Automate tumor detection** in MRI scans using a custom-built convolutional neural network (CNN), trained from scratch on balanced authentic data.

### **Secondary Objective**:
> **Enable visual interpretability** to help differentiate between tumor and non-tumor MRI scans using model predictions, confidence scores, and evaluation metrics for dashboard integration.

## Technical Objectives

* ‚úÖ **Custom CNN Architecture**: Build a CNN from scratch optimized for medical image classification (no pre-trained models)
* ‚úÖ **Binary Classification**: Train model to distinguish between tumor vs. no-tumor MRI scans
* ‚úÖ **Balanced Authentic Data**: Utilize balanced sampling from DataCollection (no augmentation)
* ‚úÖ **Performance Optimization**: Achieve >90% accuracy, >88% recall, <1.5 sec/inference time
* ‚úÖ **Threshold Optimization**: Find optimal classification threshold through precision-recall curve analysis
* ‚úÖ **Model Evaluation**: Comprehensive analysis using accuracy, precision, recall, F1-score, and confusion matrix
* ‚úÖ **Confidence Analysis**: Generate prediction confidence scores for model interpretability
* ‚úÖ **Dashboard Integration**: Create evaluation artifacts for Streamlit dashboard consumption

## Inputs

* ‚úÖ **Training Data**: Balanced authentic MRI brain tumor images from DataCollection notebook
  - Train/validation/test splits with verified no data leakage
  - Binary classification: tumor vs no-tumor with balanced class distribution via intelligent sampling
  - Image preprocessing: 224x224 RGB, single normalization to [0,1] range
  - **No augmentation** - maintains authentic MRI data quality
* ‚úÖ **Model Requirements**: Custom CNN architecture specifications
  - Progressive filter sizes: 16 ‚Üí 32 ‚Üí 64
  - Compact design for real-time inference (<1.5 sec/sample)
  - Binary output with sigmoid activation
  - Optimized for balanced authentic data (reduced dropout)

## Expected Outputs

* üéØ **Custom CNN Model**: 
  - Architecture: 3 convolutional blocks + dense layers
  - Training on balanced authentic MRI data
  - Saved as: `best_brain_tumor_model.keras`

* üéØ **Performance Metrics**: 
  - **Target Accuracy**: >90% (higher expectation for authentic data)
  - **Target Recall**: >88% (critical for medical use)
  - **Inference Time**: <1.5 sec/sample
  - **Data Quality**: Authentic MRI (no augmentation artifacts)

* üéØ **Evaluation Artifacts**:
  - `test_predictions.csv`: Individual predictions with confidence scores
  - `evaluation_metrics.json`: Comprehensive performance metrics
  - `confusion_matrices.json`: Confusion matrix data for both thresholds
  - `training_history.json`: Training progression metrics

* üéØ **Model Interpretability**:
  - Confidence score distribution analysis
  - Precision-recall curve with optimal threshold identification
  - Performance comparison: default vs optimal thresholds

## Data Quality Advantages

* **Authentic MRI Quality**: No augmentation artifacts, preserving medical image integrity
* **Balanced Sampling**: Intelligent class balancing from DataCollection maintains data authenticity
* **Reduced Overfitting**: Authentic data typically generalizes better than augmented data
* **Faster Training**: Balanced data often converges faster than imbalanced datasets
* **Clinical Relevance**: Real MRI characteristics preserved for better clinical applicability

## Success Criteria

| Component | Target | Approach |
|-----------|---------|----------|
| **Accuracy** | >90% | Authentic balanced data |
| **Recall** | >88% | High sensitivity for medical use |
| **Inference Time** | <1.5 sec | Efficient CNN architecture |
| **Data Quality** | Authentic | No augmentation, balanced sampling |
| **Model Type** | Custom CNN | Built from scratch |
| **Dashboard Ready** | Yes | All artifacts generated |

To run evaluation on an already trained model, execute the following cells in order:

1. **Cell 8** (07dd3490) - Import libraries (glob, numpy, tf, etc.)
2. **Cell 7** (742e718d) - Define data directories
3. **Cell 9** (60b9fbb5) - Set `IMG_SIZE` and `BATCH_SIZE`
4. **Cell 11** (b90b9f33) - Extract file paths and labels
5. **Cell 15** (f0d9ec00) - Define preprocessing function
6. **Cell 17** (075210b6) - Create `test_ds` ‚≠ê
7. **Cell 35** (de80e652) - Test Set Evaluation ‚úÖ

---

## 1. Change Working Directory

In [None]:
import os

# Check current directory
current_dir = os.getcwd()
print(f"Current directory: {current_dir}")

# Change to project root directory
os.chdir('/workspaces/brain-tumor-classification')
print(f"Working directory changed to: {os.getcwd()}")

## 2. Import Core Libraries

In [None]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
import random
import warnings
import glob
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import precision_recall_curve, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split

# Set random seeds for reproducibility
SEED = 42
np.random.seed(SEED)
tf.random.set_seed(SEED)
random.seed(SEED)

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print(f"TensorFlow version: {tf.__version__}")
print(f"Keras version: {keras.__version__}")
print(f"Numpy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print("‚úÖ All libraries imported successfully!")

## 3. Data Loading & Splitting

In [None]:
# Define data directories - Updated to match DataCollection notebook exactly
print("üîç Using DataCollection notebook output structure...")
print("=" * 50)

# Use exact paths from DataCollection notebook
train_dir = "inputs/brain_tumor_dataset/train"
val_dir = "inputs/brain_tumor_dataset/validation"  
test_dir = "inputs/brain_tumor_dataset/test"

# Verify the directories exist (DataCollection creates these)
print(f"üìÅ Checking DataCollection output directories:")
for split, path in [('train', train_dir), ('validation', val_dir), ('test', test_dir)]:
    if os.path.exists(path):
        subdirs = [d for d in os.listdir(path) if os.path.isdir(os.path.join(path, d))]
        print(f"‚úÖ {split.capitalize()}: {path}")
        print(f"   Classes: {subdirs}")
        
        # Count images in each class
        for class_name in subdirs:
            class_path = os.path.join(path, class_name)
            image_count = len([f for f in os.listdir(class_path) 
                             if f.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.tiff'))])
            print(f"   {class_name}: {image_count} images")
    else:
        print(f"‚ùå {split.capitalize()}: {path} (NOT FOUND)")
        print("   Run DataCollection notebook first to create balanced splits")

print(f"\nüéØ DataCollection Output Structure:")
print(f"Training: {train_dir}")
print(f"Validation: {val_dir}")
print(f"Test: {test_dir}")

# Final verification
all_exist = all(os.path.exists(d) for d in [train_dir, val_dir, test_dir])
if all_exist:
    print("‚úÖ All DataCollection directories verified!")
    print("‚úÖ Ready to use DataCollection balanced splits!")
else:
    print("‚ùå DataCollection splits not found!")
    print("‚ö†Ô∏è  Please run the DataCollection notebook first to create balanced splits")
    raise FileNotFoundError("DataCollection output not available - run DataCollection notebook first")

## 4. Data Preparation & Normalization

In [None]:
# Set image size and batch size constants
IMG_SIZE = (224, 224)
BATCH_SIZE = 16

print(f"Image size: {IMG_SIZE}")
print(f"Batch size: {BATCH_SIZE}")
print("‚úÖ Data preparation constants set!")

## 5. Build File Path and Label Lists

In [None]:
def get_file_paths_and_labels(data_dir):
    """Extract file paths and labels from DataCollection directory structure"""
    if not os.path.exists(data_dir):
        print(f"‚ùå Directory not found: {data_dir}")
        print("   Run DataCollection notebook first")
        return [], [], []
    
    class_names = sorted(os.listdir(data_dir))
    # Filter out non-directories
    class_names = [name for name in class_names if os.path.isdir(os.path.join(data_dir, name))]
    
    file_paths = []
    labels = []
    
    print(f"üìÅ Processing DataCollection directory: {data_dir}")
    print(f"   Found classes: {class_names}")
    
    for idx, class_name in enumerate(class_names):
        class_dir = os.path.join(data_dir, class_name)
        # DataCollection saves as PNG files
        png_files = glob.glob(os.path.join(class_dir, '*.png'))
        jpg_files = glob.glob(os.path.join(class_dir, '*.jpg'))
        jpeg_files = glob.glob(os.path.join(class_dir, '*.jpeg'))
        files = png_files + jpg_files + jpeg_files
        
        file_paths.extend(files)
        labels.extend([idx] * len(files))
        print(f"   {class_name}: {len(files)} files")
    
    return file_paths, labels, class_names

# Extract file paths and labels from DataCollection output
print("üîç Extracting file paths and labels from DataCollection output...")
print("=" * 70)

# Use DataCollection's balanced splits
train_files, train_labels, class_names = get_file_paths_and_labels(train_dir)
val_files, val_labels, _ = get_file_paths_and_labels(val_dir)
test_files, test_labels, _ = get_file_paths_and_labels(test_dir)

# Verify DataCollection output exists
if not train_files:
    print("‚ùå No training files found in DataCollection output!")
    print("‚ö†Ô∏è  Run DataCollection notebook first to create balanced dataset")
    print("Available directories:")
    for item in os.listdir('.'):
        if os.path.isdir(item):
            print(f"  - {item}/")
    raise FileNotFoundError("DataCollection output not found - run DataCollection notebook first")

# Analyze DataCollection class balance
unique, counts = np.unique(train_labels, return_counts=True)
class_balance = dict(zip(class_names, counts))

print("\nüìä DataCollection Balanced Dataset Statistics:")
print("=" * 50)
print(f"Classes: {class_names}")
print(f"Class balance in training set: {class_balance}")

# Check DataCollection balance quality
imbalance_ratio = min(counts) / max(counts)
print(f"Balance ratio: {imbalance_ratio:.3f}")

# DataCollection provides balanced data
if imbalance_ratio > 0.95:
    print("‚úÖ Excellent balance from DataCollection!")
    balance_quality = "EXCELLENT"
elif imbalance_ratio > 0.8:
    print("‚úÖ Good balance from DataCollection")
    balance_quality = "GOOD"
else:
    print("‚úÖ DataCollection balance applied")
    balance_quality = "BALANCED"

# No class weights needed with DataCollection
class_weights = None
print("‚úÖ No class weights needed (DataCollection handles balancing)")

print(f"\nDataCollection dataset sizes:")
print(f"Train samples: {len(train_files)}")
print(f"Validation samples: {len(val_files)}")
print(f"Test samples: {len(test_files)}")

# Verify DataCollection binary classification
expected_classes = ['notumor', 'tumor']
if set(class_names) == set(expected_classes):
    print("‚úÖ DataCollection binary classification confirmed: notumor vs tumor")
elif len(class_names) == 2:
    print(f"‚úÖ DataCollection binary classification: {class_names}")
else:
    print(f"‚ö†Ô∏è  Found {len(class_names)} classes: {class_names}")
    print("   DataCollection typically provides binary classification")

print(f"\nüéâ DataCollection output loaded successfully!")
print(f"üìä Balance Quality: {balance_quality}")
print(f"üî¨ Data Source: DataCollection balanced sampling")

## 6. Verify Data Splits

In [None]:
def preprocess_image(image_path, target_size=(224, 224)):
    """Preprocess image for DataCollection output"""
    try:
        # DataCollection saves images as PNG/JPG
        image = tf.io.read_file(image_path)
        # Handle both PNG and JPG from DataCollection
        if tf.strings.regex_full_match(image_path, '.*\\.png$'):
            image = tf.image.decode_png(image, channels=3)
        else:
            image = tf.image.decode_jpeg(image, channels=3)
        
        # Resize to target size
        image = tf.image.resize(image, target_size)
        # Normalize pixel values to [0, 1]
        image = tf.cast(image, tf.float32) / 255.0
        
        return image
    except Exception as e:
        print(f"‚ùå Error preprocessing image {image_path}: {e}")
        # Return black image as fallback
        return tf.zeros(target_size + (3,), dtype=tf.float32)

def create_dataset(file_paths, labels, batch_size=32, shuffle=True, augment=False):
    """Create TensorFlow Dataset from DataCollection files"""
    print(f"üìÅ Creating dataset from {len(file_paths)} DataCollection files...")
    
    # Verify DataCollection files exist
    missing_files = []
    for fp in file_paths[:5]:  # Check first 5 files
        if not os.path.exists(fp):
            missing_files.append(fp)
    
    if missing_files:
        print(f"‚ùå Missing DataCollection files: {missing_files}")
        print("   Run DataCollection notebook to create balanced dataset")
        raise FileNotFoundError("DataCollection output incomplete")
    
    # Create dataset
    dataset = tf.data.Dataset.from_tensor_slices((file_paths, labels))
    
    # Map preprocessing function
    dataset = dataset.map(
        lambda x, y: (preprocess_image(x), y),
        num_parallel_calls=tf.data.AUTOTUNE
    )
    
    # NO AUGMENTATION for authentic MRI data
    if augment:
        print("‚ö†Ô∏è  WARNING: Augmentation disabled for authentic MRI data")
        print("   Medical images maintain diagnostic quality without artificial distortions")
    
    # Shuffle if specified
    if shuffle:
        dataset = dataset.shuffle(buffer_size=min(1000, len(file_paths)))
    
    # Batch and prefetch
    dataset = dataset.batch(batch_size)
    dataset = dataset.prefetch(tf.data.AUTOTUNE)
    
    print(f"‚úÖ Dataset created with batch size {batch_size} (NO augmentation)")
    return dataset

# Create TensorFlow datasets from DataCollection output
print("üöÄ Creating TensorFlow datasets from DataCollection output...")
print("=" * 70)

# Create datasets using DataCollection balanced splits - NO AUGMENTATION
train_ds = create_dataset(
    train_files, train_labels, 
    batch_size=BATCH_SIZE, 
    shuffle=True, 
    augment=False  # FIXED: No augmentation for authentic MRI data
)

val_ds = create_dataset(
    val_files, val_labels,
    batch_size=BATCH_SIZE,
    shuffle=False,
    augment=False  # No augmentation for validation
)

test_ds = create_dataset(
    test_files, test_labels,
    batch_size=BATCH_SIZE,
    shuffle=False,
    augment=False  # No augmentation for test
)

print(f"\nüìä DataCollection Dataset Configuration:")
print(f"Input shape: {IMG_SIZE + (3,)}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Number of classes: {len(class_names)}")
print(f"Class names: {class_names}")
print(f"Augmentation: DISABLED (authentic MRI data preserved)")

# Verify dataset shapes
print(f"\nüîç Dataset Verification:")
for x_batch, y_batch in train_ds.take(1):
    print(f"Training batch shape: {x_batch.shape}, {y_batch.shape}")
    print(f"Image dtype: {x_batch.dtype}")
    print(f"Label dtype: {y_batch.dtype}")
    print(f"Image value range: [{tf.reduce_min(x_batch):.3f}, {tf.reduce_max(x_batch):.3f}]")
    break

print(f"\n‚úÖ DataCollection datasets ready for training!")
print(f"üéØ Training strategy: Balanced authentic data (NO augmentation)")
print(f"üè• Medical image quality: Preserved for diagnostic accuracy")

## 7. Preprocess Image Function

In [None]:
def create_cnn_model(input_shape=(224, 224, 3), num_classes=2):
    """
    Create CNN model for DataCollection binary classification
    
    Architecture optimized for DataCollection balanced dataset:
    - TRUE binary classification: single output with sigmoid
    - 224x224 input images
    - Progressive filter scaling: 16 ‚Üí 32 ‚Üí 64
    - Global Average Pooling to reduce overfitting
    - Dropout for regularization
    """
    
    model = tf.keras.Sequential([
        # Input layer
        tf.keras.layers.Input(shape=input_shape),
        
        # First convolutional block
        tf.keras.layers.Conv2D(16, (3, 3), activation='relu', padding='same'),
        tf.keras.layers.MaxPooling2D((2, 2)),
        
        # Second convolutional block
        tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
        tf.keras.layers.MaxPooling2D((2, 2)),
        
        # Third convolutional block
        tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same'),
        tf.keras.layers.MaxPooling2D((2, 2)),
        
        # Global Average Pooling instead of Flatten
        tf.keras.layers.GlobalAveragePooling2D(),
        
        # Dropout for regularization
        tf.keras.layers.Dropout(0.2),
        
        # Output layer for TRUE binary classification
        tf.keras.layers.Dense(1, activation='sigmoid')  # FIXED: Single output with sigmoid
    ])
    
    return model

# Create model for DataCollection balanced dataset
print("üèóÔ∏è  Creating CNN model for DataCollection binary classification...")
print("=" * 70)

model = create_cnn_model(
    input_shape=IMG_SIZE + (3,),
    num_classes=2  # Still pass 2 for documentation, but model uses 1 output
)

# Compile model with BINARY classification settings
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',  # FIXED: Binary crossentropy for sigmoid output
    metrics=['accuracy']
)

# Display model architecture
print("\nüìã Model Architecture for DataCollection Dataset:")
print("=" * 50)
model.summary()

# Calculate model parameters
total_params = model.count_params()
print(f"\nüìä Model Statistics:")
print(f"Total parameters: {total_params:,}")
print(f"Input shape: {IMG_SIZE + (3,)}")
print(f"Output: Single neuron with sigmoid (binary classification)")

# Model optimization for DataCollection
print(f"\nüéØ Model Optimization:")
print(f"‚úÖ Architecture: Lightweight CNN (16‚Üí32‚Üí64 filters)")
print(f"‚úÖ Regularization: Dropout (0.2) + Global Average Pooling")
print(f"‚úÖ Optimizer: Adam (adaptive learning rate)")
print(f"‚úÖ Loss: Binary crossentropy (TRUE binary classification)")
print(f"‚úÖ Target: DataCollection balanced binary classification")

# Verify model is ready for DataCollection data
print(f"\nüîç DataCollection Compatibility Check:")
print(f"‚úÖ Input shape matches DataCollection preprocessing: {IMG_SIZE + (3,)}")
print(f"‚úÖ Output configured for binary classification: 1 neuron + sigmoid")
print(f"‚úÖ Model ready for DataCollection balanced training!")

print(f"\nüéâ Model created successfully for DataCollection dataset!")
print(f"üöÄ Ready to train on balanced tumor detection data")

## 8. Create tf.data Datasets

In [None]:
# Training callbacks for DataCollection model
print("‚öôÔ∏è  Setting up training callbacks for DataCollection model...")
print("=" * 60)

# Create callbacks
callbacks = [
    # Early stopping to prevent overfitting on DataCollection data
    tf.keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=10,
        restore_best_weights=True,
        verbose=1
    ),
    
    # Learning rate reduction for better convergence
    tf.keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=5,
        min_lr=1e-7,
        verbose=1
    ),
    
    # Model checkpointing to save best model
    tf.keras.callbacks.ModelCheckpoint(
        filepath='best_brain_tumor_model.keras',
        monitor='val_accuracy',
        save_best_only=True,
        verbose=1
    )
]

print("üìã Callback Configuration:")
print("‚úÖ Early Stopping: Monitor val_loss, patience=10")
print("‚úÖ Learning Rate Reduction: Factor=0.5, patience=5")
print("‚úÖ Model Checkpoint: Save best model based on val_accuracy")
print("‚úÖ Optimized for DataCollection balanced training")

# Training configuration
EPOCHS = 50
print(f"\nüéØ Training Configuration:")
print(f"Max epochs: {EPOCHS}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Dataset: DataCollection balanced splits")
print(f"Training samples: {len(train_files)}")
print(f"Validation samples: {len(val_files)}")

# Calculate steps per epoch
train_steps = len(train_files) // BATCH_SIZE
val_steps = len(val_files) // BATCH_SIZE

print(f"Training steps per epoch: {train_steps}")
print(f"Validation steps per epoch: {val_steps}")

print(f"\nüöÄ Ready to train on DataCollection balanced dataset!")

## 9. Model Training

Train the CNN model on DataCollection balanced dataset.

In [None]:
print("üöÄ Starting model training with DataCollection balanced data...")
print("=" * 60)
print("üìä Training Configuration:")
print("   - DataCollection balanced sampling: 1,400 samples per class")
print("   - Authentic MRI data (no augmentation)")
print("   - Clean train/validation/test splits")
print("   - Perfect class balance (50/50)")
print("   - Optimized callbacks for balanced data")
print("   - Expected faster convergence")
print()
print("‚è≥ Training in progress...")

# Full training with DataCollection balanced data
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=35,  # Reduced epochs for balanced data (faster convergence expected)
    callbacks=callbacks,
    verbose=1
)

print("\nüéâ Training completed!")
print("üìä Training with DataCollection balanced authentic data")
print("‚úÖ 1,400 samples per class (perfectly balanced)")
print("üîç Let's analyze the training results...")

---

## 10. Training History Visualization

Let's visualize the training process to understand model performance over time.

In [None]:
# Visualize DataCollection training history
print("üìä Visualizing DataCollection training history...")
print("=" * 60)

# Create training history plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Plot accuracy
ax1.plot(history.history['accuracy'], label='Training Accuracy', color='blue', linewidth=2)
ax1.plot(history.history['val_accuracy'], label='Validation Accuracy', color='orange', linewidth=2)
ax1.set_title('DataCollection Model Accuracy', fontsize=14, fontweight='bold')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Accuracy')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Mark best epoch
best_epoch = np.argmax(history.history['val_accuracy'])
ax1.axvline(x=best_epoch, color='red', linestyle='--', alpha=0.7, label=f'Best Epoch ({best_epoch + 1})')
ax1.legend()

# Plot loss
ax2.plot(history.history['loss'], label='Training Loss', color='blue', linewidth=2)
ax2.plot(history.history['val_loss'], label='Validation Loss', color='orange', linewidth=2)
ax2.set_title('DataCollection Model Loss', fontsize=14, fontweight='bold')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Loss')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Mark best epoch
ax2.axvline(x=best_epoch, color='red', linestyle='--', alpha=0.7, label=f'Best Epoch ({best_epoch + 1})')
ax2.legend()

plt.tight_layout()
plt.suptitle('DataCollection Binary Classification Training Results', fontsize=16, fontweight='bold', y=1.02)
plt.show()

# Training metrics summary
print(f"\nüìà DataCollection Training Metrics Summary:")
print("=" * 50)
print(f"Dataset: DataCollection balanced binary classification")
print(f"Classes: {class_names}")
print(f"Training samples: {len(train_files)}")
print(f"Validation samples: {len(val_files)}")

print(f"\nFinal Metrics:")
print(f"Training Accuracy: {history.history['accuracy'][-1]:.4f}")
print(f"Validation Accuracy: {history.history['val_accuracy'][-1]:.4f}")
print(f"Training Loss: {history.history['loss'][-1]:.4f}")
print(f"Validation Loss: {history.history['val_loss'][-1]:.4f}")

print(f"\nBest Validation Performance:")
print(f"Best Epoch: {best_epoch + 1}")
print(f"Best Validation Accuracy: {max(history.history['val_accuracy']):.4f}")
print(f"Best Validation Loss: {history.history['val_loss'][best_epoch]:.4f}")

# Check convergence
recent_val_loss = np.mean(history.history['val_loss'][-5:])
print(f"\nConvergence Analysis:")
print(f"Recent validation loss (last 5 epochs): {recent_val_loss:.4f}")

if len(history.history['loss']) < EPOCHS:
    print("‚úÖ Early stopping triggered - model converged")
else:
    print("‚ö†Ô∏è  Training completed full epochs")

print(f"\nüéØ DataCollection model training visualization complete!")
print(f"üìä Model shows {'good' if max(history.history['val_accuracy']) > 0.8 else 'moderate'} performance on balanced dataset")

## 11. Model Evaluation

Evaluate the trained model on the test dataset to assess its performance.

In [None]:
# Evaluate model on DataCollection test dataset
print("üîç Evaluating model on DataCollection test dataset...")
print("=" * 60)

# Load best model
try:
    model = tf.keras.models.load_model('best_brain_tumor_model.keras')
    print("‚úÖ Best model loaded successfully")
except:
    print("‚ö†Ô∏è  Using current model (best model file not found)")

# Evaluate on test dataset
test_loss, test_accuracy = model.evaluate(test_ds, verbose=1)

print(f"\nüìä DataCollection Test Results:")
print("=" * 40)
print(f"Test Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")
print(f"Test Loss: {test_loss:.4f}")

# Get predictions for detailed analysis
print(f"\nüîç Generating predictions on DataCollection test set...")
predictions = model.predict(test_ds)
predicted_classes = np.argmax(predictions, axis=1)

# Get true labels
true_labels = []
for _, labels in test_ds:
    true_labels.extend(labels.numpy())
true_labels = np.array(true_labels)

# Classification report
from sklearn.metrics import classification_report, confusion_matrix
print(f"\nüìã Detailed Classification Report:")
print("=" * 50)
print(classification_report(true_labels, predicted_classes, target_names=class_names))

# Confusion matrix
cm = confusion_matrix(true_labels, predicted_classes)
print(f"\nConfusion Matrix:")
print(f"{'':>12} {'Predicted':>20}")
print(f"{'Actual':>8} {class_names[0]:>10} {class_names[1]:>10}")
for i, class_name in enumerate(class_names):
    print(f"{class_name:>8} {cm[i][0]:>10} {cm[i][1]:>10}")

# Calculate per-class metrics
print(f"\nüìà Per-Class Performance:")
print("=" * 40)
for i, class_name in enumerate(class_names):
    class_mask = true_labels == i
    class_accuracy = np.mean(predicted_classes[class_mask] == true_labels[class_mask])
    class_count = np.sum(class_mask)
    print(f"{class_name:>10}: {class_accuracy:.4f} ({class_accuracy*100:.2f}%) - {class_count} samples")

# Performance summary
print(f"\nüéØ DataCollection Test Performance Summary:")
print("=" * 50)
print(f"Dataset: DataCollection balanced test set")
print(f"Test samples: {len(test_files)}")
print(f"Classes: {class_names}")
print(f"Overall accuracy: {test_accuracy:.4f}")

# Performance assessment
if test_accuracy > 0.9:
    performance_level = "Excellent"
    emoji = "üéâ"
elif test_accuracy > 0.8:
    performance_level = "Good"
    emoji = "‚úÖ"
elif test_accuracy > 0.7:
    performance_level = "Moderate"
    emoji = "‚ö†Ô∏è"
else:
    performance_level = "Needs Improvement"
    emoji = "‚ùå"

print(f"\nPerformance Level: {emoji} {performance_level}")
print(f"Model Quality: {'Production Ready' if test_accuracy > 0.85 else 'Needs Improvement'}")

print(f"\nüéâ DataCollection model evaluation complete!")
print(f"üìä Model achieves {test_accuracy*100:.2f}% accuracy on balanced test data")

## 12. Confusion Matrix Analysis

Analyze the confusion matrix to understand model performance by class.

In [None]:
# Visualize DataCollection confusion matrix
print("üìä Creating confusion matrix visualization for DataCollection test results...")
print("=" * 70)

# Calculate confusion matrix
cm = confusion_matrix(true_labels, predicted_classes)

# Create visualization
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=class_names, yticklabels=class_names,
            cbar_kws={'label': 'Count'})

plt.title('DataCollection Test Set - Confusion Matrix', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)

# Add performance annotations
total_samples = np.sum(cm)
accuracy = np.trace(cm) / total_samples

plt.figtext(0.02, 0.02, f'DataCollection Test Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)', 
            fontsize=10, ha='left')
plt.figtext(0.02, 0.05, f'Total Samples: {total_samples}', fontsize=10, ha='left')
plt.figtext(0.02, 0.08, f'Classes: {", ".join(class_names)}', fontsize=10, ha='left')

plt.tight_layout()
plt.show()

# Detailed confusion matrix analysis
print(f"\nüìã DataCollection Confusion Matrix Analysis:")
print("=" * 50)

# Calculate metrics for each class
for i, class_name in enumerate(class_names):
    tp = cm[i, i]  # True Positives
    fp = cm[:, i].sum() - tp  # False Positives
    fn = cm[i, :].sum() - tp  # False Negatives
    tn = cm.sum() - tp - fp - fn  # True Negatives
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    print(f"\n{class_name.upper()}:")
    print(f"  True Positives: {tp}")
    print(f"  False Positives: {fp}")
    print(f"  False Negatives: {fn}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall: {recall:.4f}")
    print(f"  F1-Score: {f1:.4f}")

# Overall performance
print(f"\nüéØ DataCollection Overall Performance:")
print("=" * 40)
print(f"Total Test Samples: {total_samples}")
print(f"Correct Predictions: {np.trace(cm)}")
print(f"Incorrect Predictions: {total_samples - np.trace(cm)}")
print(f"Overall Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")

# Balance analysis
print(f"\n‚öñÔ∏è  DataCollection Balance Analysis:")
print("=" * 40)
for i, class_name in enumerate(class_names):
    class_total = cm[i, :].sum()
    class_percentage = (class_total / total_samples) * 100
    print(f"{class_name}: {class_total} samples ({class_percentage:.1f}%)")

# Error analysis
print(f"\nüîç Error Analysis:")
print("=" * 30)
total_errors = total_samples - np.trace(cm)
error_rate = total_errors / total_samples

print(f"Total Errors: {total_errors}")
print(f"Error Rate: {error_rate:.4f} ({error_rate*100:.2f}%)")

# Most common errors
if len(class_names) == 2:
    false_positives = cm[0, 1]  # notumor predicted as tumor
    false_negatives = cm[1, 0]  # tumor predicted as notumor
    
    print(f"\nBinary Classification Errors:")
    print(f"  False Positives ({class_names[0]}‚Üí{class_names[1]}): {false_positives}")
    print(f"  False Negatives ({class_names[1]}‚Üí{class_names[0]}): {false_negatives}")
    
    if false_positives > false_negatives:
        print("  ‚ö†Ô∏è  More false positives (over-detection)")
    elif false_negatives > false_positives:
        print("  ‚ö†Ô∏è  More false negatives (under-detection)")
    else:
        print("  ‚úÖ Balanced error distribution")

print(f"\nüéâ DataCollection confusion matrix analysis complete!")
print(f"üìä Model performance: {performance_level} on balanced test data")

In [None]:
# Visualize DataCollection predictions
print("üîç Visualizing predictions on DataCollection test samples...")
print("=" * 60)

# Get a batch of test images and predictions
test_images = []
test_labels = []
for images, labels in test_ds.take(1):
    test_images = images
    test_labels = labels
    break

# Make predictions (sigmoid output for binary classification)
predictions = model.predict(test_images)
predicted_classes = (predictions.flatten() > 0.5).astype(int)  # FIXED: Binary classification

# Create visualization
fig, axes = plt.subplots(3, 4, figsize=(16, 12))
axes = axes.ravel()

print(f"Showing 12 random DataCollection test samples with predictions...")

for i in range(12):
    # Display image
    axes[i].imshow(test_images[i])
    
    # FIXED: Binary classification confidence display
    tumor_prob = predictions[i][0]  # Sigmoid probability
    confidence = abs(tumor_prob - 0.5) * 2  # Distance from decision boundary
    
    axes[i].set_title(f'True: {class_names[test_labels[i]]}\n'
                     f'Pred: {class_names[predicted_classes[i]]}\n'
                     f'Tumor Prob: {tumor_prob:.3f}\n'
                     f'Confidence: {confidence:.3f}',
                     fontsize=10)
    axes[i].axis('off')
    
    # Color code: green for correct, red for incorrect
    if test_labels[i] == predicted_classes[i]:
        axes[i].set_facecolor('lightgreen')
        axes[i].set_alpha(0.3)
    else:
        axes[i].set_facecolor('lightcoral')
        axes[i].set_alpha(0.3)

plt.suptitle('DataCollection Test Predictions\n(Green=Correct, Red=Incorrect)', 
             fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# Prediction statistics
correct_predictions = np.sum(test_labels.numpy() == predicted_classes)
batch_accuracy = correct_predictions / len(test_labels)

print(f"\nüìä Batch Prediction Statistics:")
print("=" * 40)
print(f"Batch size: {len(test_labels)}")
print(f"Correct predictions: {correct_predictions}")
print(f"Incorrect predictions: {len(test_labels) - correct_predictions}")
print(f"Batch accuracy: {batch_accuracy:.4f} ({batch_accuracy*100:.2f}%)")

# FIXED: Confidence analysis for binary classification
print(f"\nüéØ Confidence Analysis:")
print("=" * 30)
tumor_probs = predictions.flatten()
confidence_scores = np.abs(tumor_probs - 0.5) * 2  # Distance from decision boundary [0,1]

avg_confidence = np.mean(confidence_scores)
min_confidence = np.min(confidence_scores)
max_confidence = np.max(confidence_scores)

print(f"Average confidence: {avg_confidence:.4f}")
print(f"Min confidence: {min_confidence:.4f}")
print(f"Max confidence: {max_confidence:.4f}")

# High/low confidence analysis
high_confidence_mask = confidence_scores > 0.8
low_confidence_mask = confidence_scores < 0.4

print(f"\nHigh confidence (>0.8): {np.sum(high_confidence_mask)} samples")
print(f"Low confidence (<0.4): {np.sum(low_confidence_mask)} samples")

# Check accuracy by confidence level
if np.sum(high_confidence_mask) > 0:
    high_conf_accuracy = np.mean(test_labels.numpy()[high_confidence_mask] == predicted_classes[high_confidence_mask])
    print(f"High confidence accuracy: {high_conf_accuracy:.4f}")

if np.sum(low_confidence_mask) > 0:
    low_conf_accuracy = np.mean(test_labels.numpy()[low_confidence_mask] == predicted_classes[low_confidence_mask])
    print(f"Low confidence accuracy: {low_conf_accuracy:.4f}")

print(f"\nüéâ DataCollection prediction visualization complete!")
print(f"üìä Model shows {'high' if avg_confidence > 0.6 else 'moderate'} confidence on test samples")
print(f"üîç Tumor probability range: [{np.min(tumor_probs):.3f}, {np.max(tumor_probs):.3f}]")

## 14. Per-Class Accuracy Analysis

Analyze model performance for each class individually.

In [None]:
# Visualize DataCollection prediction accuracy by class
print("üìä Analyzing DataCollection prediction accuracy by class...")
print("=" * 60)

# Calculate per-class accuracy
class_accuracies = []
class_counts = []

for i, class_name in enumerate(class_names):
    class_mask = true_labels == i
    class_predictions = predicted_classes[class_mask]
    class_true = true_labels[class_mask]
    
    if len(class_true) > 0:
        accuracy = np.mean(class_predictions == class_true)
        class_accuracies.append(accuracy)
        class_counts.append(len(class_true))
    else:
        class_accuracies.append(0.0)
        class_counts.append(0)

# Create visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: Accuracy by class
colors = ['#FF6B6B', '#4ECDC4']
bars = ax1.bar(class_names, class_accuracies, color=colors, alpha=0.8, edgecolor='black')
ax1.set_title('DataCollection Per-Class Accuracy', fontsize=14, fontweight='bold')
ax1.set_ylabel('Accuracy')
ax1.set_ylim(0, 1)
ax1.grid(True, alpha=0.3)

# Add value labels on bars
for bar, acc in zip(bars, class_accuracies):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.01,
             f'{acc:.3f}', ha='center', va='bottom', fontweight='bold')

# Plot 2: Sample distribution
ax2.bar(class_names, class_counts, color=colors, alpha=0.8, edgecolor='black')
ax2.set_title('DataCollection Test Sample Distribution', fontsize=14, fontweight='bold')
ax2.set_ylabel('Number of Samples')
ax2.grid(True, alpha=0.3)

# Add value labels on bars
for i, count in enumerate(class_counts):
    ax2.text(i, count + 10, str(count), ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.suptitle('DataCollection Test Results Analysis', fontsize=16, fontweight='bold', y=1.02)
plt.show()

# Performance summary
print(f"\nüìã DataCollection Class Performance Summary:")
print("=" * 50)
for i, class_name in enumerate(class_names):
    print(f"{class_name:>10}: {class_accuracies[i]:.4f} accuracy ({class_counts[i]} samples)")

# Overall statistics
overall_accuracy = np.mean(class_accuracies)
weighted_accuracy = np.average(class_accuracies, weights=class_counts)

print(f"\nOverall Statistics:")
print(f"Average accuracy: {overall_accuracy:.4f}")
print(f"Weighted accuracy: {weighted_accuracy:.4f}")
print(f"Test accuracy: {test_accuracy:.4f}")

# Balance analysis
print(f"\n‚öñÔ∏è  DataCollection Balance Analysis:")
print("=" * 40)
total_samples = sum(class_counts)
for i, class_name in enumerate(class_names):
    percentage = (class_counts[i] / total_samples) * 100
    print(f"{class_name}: {class_counts[i]}/{total_samples} ({percentage:.1f}%)")

# Check if dataset is balanced
balance_ratio = min(class_counts) / max(class_counts) if max(class_counts) > 0 else 0
print(f"Balance ratio: {balance_ratio:.3f}")

if balance_ratio > 0.8:
    print("‚úÖ Well-balanced dataset")
elif balance_ratio > 0.5:
    print("‚ö†Ô∏è  Moderately balanced dataset")
else:
    print("‚ùå Imbalanced dataset")

# Performance assessment
print(f"\nüéØ DataCollection Performance Assessment:")
print("=" * 45)
min_accuracy = min(class_accuracies)
max_accuracy = max(class_accuracies)
accuracy_gap = max_accuracy - min_accuracy

print(f"Best performing class: {class_names[np.argmax(class_accuracies)]} ({max_accuracy:.4f})")
print(f"Worst performing class: {class_names[np.argmin(class_accuracies)]} ({min_accuracy:.4f})")
print(f"Performance gap: {accuracy_gap:.4f}")

if accuracy_gap < 0.1:
    print("‚úÖ Consistent performance across classes")
elif accuracy_gap < 0.2:
    print("‚ö†Ô∏è  Moderate performance variation")
else:
    print("‚ùå High performance variation between classes")

print(f"\nüéâ DataCollection class analysis complete!")
print(f"üìä Model achieves {overall_accuracy:.4f} average accuracy across {len(class_names)} classes")

## 15. Training Summary

**Training completed with DataCollection balanced data:**

- ‚úÖ **Best model automatically saved** with lowest validation loss during training
- üéØ **ModelCheckpoint callback** ensured optimal model preservation
- üìä **Balanced dataset** provided stable training foundation
- ‚öñÔ∏è **1:1 class ratio** eliminated need for class weights
- ? **Clean data splits** ensured reliable validation metrics

**Key Training Features:**
- **Early Stopping**: Prevents overfitting with patience=10
- **Learning Rate Reduction**: Adaptive learning with factor=0.5
- **Data Augmentation**: Applied only to training set
- **Validation Monitoring**: Tracks val_loss for best model selection

## 16. STANDALONE MODEL LOADER üöÄ

**Independent model loading and evaluation section**

This section can be run independently to load the best saved model and perform comprehensive evaluation without needing to retrain. Perfect for:
- üìä **Dashboard Integration**: Load model for real-time predictions
- üîç **Model Analysis**: Evaluate performance without training
- üìà **Results Export**: Generate metrics and visualizations
- üöÄ **Production Testing**: Validate model before deployment

---

## 18. STANDALONE MODEL LOADER üöÄ

**Use this cell to quickly load your best pre-trained model without running the full training pipeline.**

This cell loads the best model that was automatically saved during training based on validation loss performance. The ModelCheckpoint callback ensures that only the truly best-performing model is saved and loaded.

In [1]:
# =============================================================================
# STANDALONE MODEL LOADING CELL
# =============================================================================
# This cell can be run independently to load your pre-trained model
# Run this cell before any evaluation cells that need the 'model' variable

import os
import tensorflow as tf
from tensorflow import keras
import numpy as np

print("üîÑ Loading Pre-trained Brain Tumor Classification Model...")
print("=" * 60)

# Check current working directory
current_dir = os.getcwd()
print(f"Current directory: {current_dir}")

# Ensure we're in the correct directory
project_root = '/workspaces/brain-tumor-classification'
if not current_dir.endswith('brain-tumor-classification'):
    os.chdir(project_root)
    print(f"Changed to project root: {project_root}")

# Model file path
model_path = 'best_brain_tumor_model.keras'

try:
    # Check if model file exists
    if not os.path.exists(model_path):
        print(f"‚ùå Model file not found: {model_path}")
        print("Available .keras files in current directory:")
        for file in os.listdir('.'):
            if file.endswith('.keras'):
                print(f"  - {file}")
        raise FileNotFoundError(f"Model file {model_path} not found")
    
    # Load the model
    print(f"üì• Loading model from: {model_path}")
    model = tf.keras.models.load_model(model_path)
    
    # Verify model loaded successfully
    print("‚úÖ Model loaded successfully!")
    
    # Display model information
    print(f"\nüìã Model Information:")
    print(f"  Model type: {type(model).__name__}")
    print(f"  Input shape: {model.input_shape}")
    print(f"  Output shape: {model.output_shape}")
    print(f"  Total parameters: {model.count_params():,}")
    print(f"  Model size: ~{model.count_params() * 4 / 1024 / 1024:.1f} MB")
    
    # Test model with a dummy input to ensure it's working
    print(f"\nüß™ Testing model functionality...")
    dummy_input = np.random.random((1, 224, 224, 3)).astype(np.float32)
    test_prediction = model.predict(dummy_input, verbose=0)
    print(f"‚úÖ Model test successful - prediction shape: {test_prediction.shape}")
    print(f"  Sample prediction: {test_prediction[0][0]:.4f}")
    
    print(f"\nüéâ Model is ready for evaluation!")
    print(f"‚úÖ Variable 'model' is now available for evaluation cells")
    
except Exception as e:
    print(f"‚ùå Error loading model: {str(e)}")
    print(f"\nTroubleshooting steps:")
    print(f"1. Check if the model file exists: {model_path}")
    print(f"2. Verify you're in the correct directory: {project_root}")
    print(f"3. Ensure the model was saved properly during training")
    print(f"4. Check if you have the correct TensorFlow version")
    
    # Set model to None to avoid confusion
    model = None
    raise e

print("=" * 60)

2025-07-16 17:14:50.156722: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-16 17:14:50.157408: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-07-16 17:14:50.160775: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-07-16 17:14:50.169583: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752686090.190321  103731 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752686090.19

üîÑ Loading Pre-trained Brain Tumor Classification Model...
Current directory: /workspaces/brain-tumor-classification/jupyter_notebooks
Changed to project root: /workspaces/brain-tumor-classification
üì• Loading model from: best_brain_tumor_model.keras
‚úÖ Model loaded successfully!

üìã Model Information:
  Model type: Sequential
  Input shape: (None, 224, 224, 3)
  Output shape: (None, 1)
  Total parameters: 23,649
  Model size: ~0.1 MB

üß™ Testing model functionality...


2025-07-16 17:14:52.965340: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


‚úÖ Model test successful - prediction shape: (1, 1)
  Sample prediction: 0.1580

üéâ Model is ready for evaluation!
‚úÖ Variable 'model' is now available for evaluation cells


In [None]:
# Quick check if model exists
try:
    print(f"Model loaded: {type(model).__name__}")
    print(f"Model parameters: {model.count_params():,}")
    print("‚úÖ Model is ready for evaluation")
except NameError:
    print("‚ùå Model not loaded - run the model loading cell first!")
    print("Run the standalone model loader cell above")

---

## 19. Test Set Evaluation

Evaluate the trained model on the unseen test set to get final performance metrics.

In [2]:
print("üß™ Evaluating model on test set...")
print("Testing with DataCollection balanced authentic data")
print("üìä Test data: Clean balanced splits from DataCollection")
print()

# Evaluate on test set
test_results = model.evaluate(test_ds, verbose=1)

# Extract metrics
test_loss = test_results[0]
test_accuracy = test_results[1]
test_precision = test_results[2] if len(test_results) > 2 else None
test_recall = test_results[3] if len(test_results) > 3 else None

print("\nüìä Final Test Results (DataCollection Balanced Data):")
print("=" * 55)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")

if test_precision is not None:
    print(f"Test Precision: {test_precision:.4f}")
if test_recall is not None:
    print(f"Test Recall: {test_recall:.4f}")

# Performance assessment - Updated for DataCollection balanced data
print("\nüéØ Performance Assessment (DataCollection Balanced Data):")
if test_accuracy >= 0.92:  # Higher expectation for balanced data
    print("üéâ Excellent performance on DataCollection balanced data!")
elif test_accuracy >= 0.85:
    print("‚úÖ Good performance on DataCollection balanced data")
elif test_accuracy >= 0.75:
    print("üü° Moderate performance - may benefit from more training")
else:
    print("‚ö†Ô∏è  Performance below expectations for balanced data")

# DataCollection benefits
print(f"\nüìà DataCollection Benefits Realized:")
print(f"‚úÖ Perfectly balanced data quality (1,400 samples per class)")
print(f"‚úÖ Authentic MRI preservation (no augmentation artifacts)")
print(f"‚úÖ Clean train/validation/test splits")
print(f"‚úÖ Reduced overfitting risk with balanced data")

print(f"\nDataCollection test set contains {len(test_labels)} samples")
print(f"Correctly classified: {int(test_accuracy * len(test_labels))} samples")
print(f"Misclassified: {int((1 - test_accuracy) * len(test_labels))} samples")

üß™ Evaluating model on test set...
Testing with DataCollection balanced authentic data
üìä Test data: Clean balanced splits from DataCollection



NameError: name 'test_ds' is not defined

## 17. Test Set Evaluation

Comprehensive evaluation of the trained model on the test dataset.

In [None]:
print("üîÆ Generating predictions and confidence scores...")

# Generate predictions on test set
print("Generating predictions...")
y_pred_probs = model.predict(test_ds, verbose=1).flatten()

# Get true labels - need to extract from the dataset properly
print("Extracting true labels...")
y_true = []
for batch in test_ds:
    _, labels = batch
    y_true.extend(labels.numpy())
y_true = np.array(y_true)

# Ensure arrays have the same length
print(f"True labels length: {len(y_true)}")
print(f"Predictions length: {len(y_pred_probs)}")

# Truncate to match the shorter array (in case of batch size mismatch)
min_length = min(len(y_true), len(y_pred_probs))
y_true = y_true[:min_length]
y_pred_probs = y_pred_probs[:min_length]

print(f"Aligned length: {min_length}")

# Convert probabilities to binary predictions (threshold = 0.5)
y_pred = (y_pred_probs > 0.5).astype(int)

print(f"\nüìä Prediction Summary:")
print(f"Total test samples: {len(y_true)}")
print(f"Predicted as Tumor: {np.sum(y_pred)} samples")
print(f"Predicted as No Tumor: {np.sum(1 - y_pred)} samples")

# Analyze confidence distribution
print(f"\nüîç Confidence Score Analysis:")
print(f"Mean confidence: {np.mean(y_pred_probs):.3f}")
print(f"Confidence std: {np.std(y_pred_probs):.3f}")
print(f"Min confidence: {np.min(y_pred_probs):.3f}")
print(f"Max confidence: {np.max(y_pred_probs):.3f}")

# Count high/low confidence predictions
high_confidence = np.sum((y_pred_probs > 0.8) | (y_pred_probs < 0.2))
low_confidence = np.sum((y_pred_probs >= 0.4) & (y_pred_probs <= 0.6))

print(f"\nConfidence Distribution:")
print(f"High confidence (>0.8 or <0.2): {high_confidence} ({high_confidence/len(y_true)*100:.1f}%)")
print(f"Low confidence (0.4-0.6): {low_confidence} ({low_confidence/len(y_true)*100:.1f}%)")

# Visualize confidence distribution
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.hist(y_pred_probs, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
plt.axvline(x=0.5, color='red', linestyle='--', label='Decision Threshold')
plt.xlabel('Confidence Score')
plt.ylabel('Frequency')
plt.title('Distribution of Confidence Scores')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
# Separate by true labels
tumor_probs = y_pred_probs[y_true == 1]
no_tumor_probs = y_pred_probs[y_true == 0]

plt.hist(no_tumor_probs, bins=15, alpha=0.7, label='No Tumor (True)', color='lightgreen')
plt.hist(tumor_probs, bins=15, alpha=0.7, label='Tumor (True)', color='lightcoral')
plt.axvline(x=0.5, color='red', linestyle='--', label='Decision Threshold')
plt.xlabel('Confidence Score')
plt.ylabel('Frequency')
plt.title('Confidence Scores by True Label')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("‚úÖ Predictions generated successfully!")

## 18. Generate Predictions and Confidence Scores

Generate predictions for all test samples with confidence scores.

In [None]:
from sklearn.metrics import precision_recall_curve, f1_score

print("üìà Computing precision-recall curve and optimal threshold...")

# Compute precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_true, y_pred_probs)

# Calculate F1 scores for each threshold
f1_scores = 2 * (precision * recall) / (precision + recall)
f1_scores = np.nan_to_num(f1_scores)  # Handle division by zero

# Find threshold that maximizes F1 score
optimal_idx = np.argmax(f1_scores)
optimal_threshold = thresholds[optimal_idx]
optimal_precision = precision[optimal_idx]
optimal_recall = recall[optimal_idx]
optimal_f1 = f1_scores[optimal_idx]

print(f"\nüéØ Optimal Threshold Analysis:")
print("=" * 50)
print(f"Optimal threshold: {optimal_threshold:.3f}")
print(f"F1 score at optimal threshold: {optimal_f1:.3f}")
print(f"Precision at optimal threshold: {optimal_precision:.3f}")
print(f"Recall at optimal threshold: {optimal_recall:.3f}")

# Compare with default threshold (0.5)
default_f1 = f1_score(y_true, y_pred)
print(f"\nComparison with default threshold (0.5):")
print(f"Default F1 score: {default_f1:.3f}")
print(f"Optimal F1 score: {optimal_f1:.3f}")
print(f"Improvement: {((optimal_f1 - default_f1) / default_f1 * 100):+.1f}%")

# Generate predictions with optimal threshold
y_pred_optimal = (y_pred_probs > optimal_threshold).astype(int)

# Plot precision-recall curve
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(recall, precision, 'b-', linewidth=2, label='Precision-Recall curve')
plt.scatter(optimal_recall, optimal_precision, c='red', s=100, zorder=5, 
           label=f'Optimal threshold = {optimal_threshold:.3f}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xlim([0, 1])
plt.ylim([0, 1])

# Plot F1 scores vs thresholds
plt.subplot(1, 2, 2)
plt.plot(thresholds, f1_scores[:-1], 'g-', linewidth=2, label='F1 Score')
plt.axvline(x=optimal_threshold, color='red', linestyle='--', 
           label=f'Optimal = {optimal_threshold:.3f}')
plt.axvline(x=0.5, color='blue', linestyle='--', alpha=0.7, label='Default = 0.5')
plt.xlabel('Threshold')
plt.ylabel('F1 Score')
plt.title('F1 Score vs Threshold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n‚úÖ Optimal threshold found: {optimal_threshold:.3f}")
print(f"üìä Using optimal threshold improves F1 score by {((optimal_f1 - default_f1) / default_f1 * 100):+.1f}%")

## 19. Precision-Recall Analysis and Optimal Threshold

Analyze precision-recall trade-offs and find optimal classification threshold.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
import seaborn as sns

print("üìä Computing confusion matrices...")

# Compute confusion matrices for both thresholds
cm_default = confusion_matrix(y_true, y_pred)
cm_optimal = confusion_matrix(y_true, y_pred_optimal)

# Display confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Default threshold confusion matrix
disp1 = ConfusionMatrixDisplay(confusion_matrix=cm_default, display_labels=class_names)
disp1.plot(ax=axes[0], cmap='Blues', values_format='d')
axes[0].set_title('Confusion Matrix (Threshold = 0.5)')

# Optimal threshold confusion matrix
disp2 = ConfusionMatrixDisplay(confusion_matrix=cm_optimal, display_labels=class_names)
disp2.plot(ax=axes[1], cmap='Greens', values_format='d')
axes[1].set_title(f'Confusion Matrix (Threshold = {optimal_threshold:.3f})')

plt.tight_layout()
plt.show()

# Detailed analysis for both thresholds
def analyze_confusion_matrix(cm, threshold_name, threshold_value):
    tn, fp, fn, tp = cm.ravel()
    
    print(f"\nüìã {threshold_name} (threshold = {threshold_value}):")
    print("-" * 50)
    print(f"True Negatives:  {tn}")
    print(f"False Positives: {fp}")
    print(f"False Negatives: {fn}")
    print(f"True Positives:  {tp}")
    
    # Calculate rates
    sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0  # Recall
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    
    print(f"\nPerformance Metrics:")
    print(f"Accuracy:    {accuracy:.3f}")
    print(f"Sensitivity: {sensitivity:.3f} (Recall)")
    print(f"Specificity: {specificity:.3f}")
    print(f"Precision:   {precision:.3f}")
    
    # False positive rate
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
    print(f"False Positive Rate: {fpr:.3f}")
    
    return {
        'accuracy': accuracy,
        'sensitivity': sensitivity,
        'specificity': specificity,
        'precision': precision,
        'fpr': fpr
    }

# Analyze both confusion matrices
metrics_default = analyze_confusion_matrix(cm_default, "Default Threshold", 0.5)
metrics_optimal = analyze_confusion_matrix(cm_optimal, "Optimal Threshold", optimal_threshold)

# Comparison
print(f"\nüîÑ Threshold Comparison:")
print("=" * 50)
print(f"Accuracy improvement:    {(metrics_optimal['accuracy'] - metrics_default['accuracy']):+.3f}")
print(f"Sensitivity improvement: {(metrics_optimal['sensitivity'] - metrics_default['sensitivity']):+.3f}")
print(f"Specificity improvement: {(metrics_optimal['specificity'] - metrics_default['specificity']):+.3f}")
print(f"Precision improvement:   {(metrics_optimal['precision'] - metrics_default['precision']):+.3f}")
print(f"FPR improvement:         {(metrics_default['fpr'] - metrics_optimal['fpr']):+.3f}")

# Classification report with optimal threshold
print(f"\nüìã Detailed Classification Report (Optimal Threshold):")
print("=" * 60)
print(classification_report(y_true, y_pred_optimal, target_names=class_names, digits=3))

## 20. Confusion Matrix Analysis

Detailed analysis of the confusion matrix for model performance understanding.

In [None]:
import pandas as pd
import json

print("üíæ Saving final results and artifacts...")
print("For DataCollection balanced authentic data approach")
print("=" * 60)

# 1. Save predictions and confidence scores
results_df = pd.DataFrame({
    'true_label': y_true,
    'predicted_label_default': y_pred,
    'predicted_label_optimal': y_pred_optimal,
    'confidence_score': y_pred_probs,
    'optimal_threshold': optimal_threshold
})

results_df.to_csv("test_predictions.csv", index=False)
print("‚úÖ Predictions saved to: test_predictions.csv")

# 2. Save evaluation metrics - Updated for DataCollection
evaluation_metrics = {
    'model_architecture': 'Custom CNN (16‚Üí32‚Üí64) for DataCollection Balanced Data',
    'total_parameters': int(model.count_params()),
    'data_approach': 'DataCollection balanced sampling (1,400 per class)',
    'data_source': 'DataCollection balanced authentic MRI data',
    'data_quality': 'Perfectly balanced, no augmentation',
    'optimal_threshold': float(optimal_threshold),
    
    # Test set metrics
    'test_loss': float(test_loss),
    'test_accuracy': float(test_accuracy),
    'test_precision': float(test_precision) if test_precision else None,
    'test_recall': float(test_recall) if test_recall else None,
    
    # Optimal threshold metrics
    'optimal_accuracy': float(metrics_optimal['accuracy']),
    'optimal_sensitivity': float(metrics_optimal['sensitivity']),
    'optimal_specificity': float(metrics_optimal['specificity']),
    'optimal_precision': float(metrics_optimal['precision']),
    'optimal_f1_score': float(optimal_f1),
    'optimal_fpr': float(metrics_optimal['fpr']),
    
    # Default threshold metrics for comparison
    'default_accuracy': float(metrics_default['accuracy']),
    'default_f1_score': float(default_f1),
    
    # DataCollection training info
    'batch_size': BATCH_SIZE,
    'image_size': IMG_SIZE,
    'class_names': class_names,
    'total_test_samples': int(len(y_true)),
    'training_samples': int(len(train_labels)),
    'validation_samples': int(len(val_labels)),
    'class_balance_ratio': float(min(counts) / max(counts)),
    'datacollection_balance': 'Perfect 1:1 ratio (1,400 per class)',
    'preprocessing': 'Single normalization to [0,1], DataCollection PNG format',
    'splits': 'Clean train/val/test from DataCollection'
}

# Add dynamic training information if available
if 'history' in locals() and hasattr(history, 'history'):
    val_loss = history.history['val_loss']
    best_epoch = np.argmin(val_loss) + 1
    best_val_loss = min(val_loss)
    
    evaluation_metrics.update({
        'training_epochs_completed': len(history.history['loss']),
        'best_epoch': best_epoch,
        'best_validation_loss': float(best_val_loss),
        'final_training_loss': float(history.history['loss'][-1]),
        'final_validation_loss': float(val_loss[-1])
    })
    
    # Add best epoch validation accuracy if available
    if 'val_accuracy' in history.history:
        best_val_acc = history.history['val_accuracy'][best_epoch - 1]
        evaluation_metrics['best_validation_accuracy'] = float(best_val_acc)
else:
    evaluation_metrics.update({
        'training_note': 'Model loaded from checkpoint - training history not available'
    })

with open('evaluation_metrics.json', 'w') as f:
    json.dump(evaluation_metrics, f, indent=2)
print("‚úÖ Evaluation metrics saved to: evaluation_metrics.json")

# 3. Save confusion matrices
confusion_matrices = {
    'default_threshold': {
        'threshold': 0.5,
        'matrix': cm_default.tolist(),
        'metrics': metrics_default
    },
    'optimal_threshold': {
        'threshold': float(optimal_threshold),
        'matrix': cm_optimal.tolist(),
        'metrics': metrics_optimal
    },
    'datacollection_info': {
        'data_source': 'DataCollection balanced sampling',
        'data_quality': 'Perfectly balanced authentic MRI (1,400 per class)',
        'class_balance': 'Perfect 1:1 ratio from DataCollection',
        'splits': 'Clean train/validation/test splits'
    }
}

with open('confusion_matrices.json', 'w') as f:
    json.dump(confusion_matrices, f, indent=2)
print("‚úÖ Confusion matrices saved to: confusion_matrices.json")

# 4. Save training history (if available)
if 'history' in locals() and hasattr(history, 'history'):
    # Extract training metrics
    train_loss = history.history['loss']
    val_loss = history.history['val_loss']
    epochs_completed = len(train_loss)
    
    # Find best epoch dynamically
    best_epoch = np.argmin(val_loss) + 1
    best_val_loss = min(val_loss)
    
    training_history = {
        'datacollection_approach': 'Balanced authentic MRI data',
        'training_info': {
            'epochs_completed': epochs_completed,
            'best_epoch': best_epoch,
            'best_validation_loss': float(best_val_loss),
            'data_quality': 'DataCollection balanced authentic MRI',
            'class_balance': 'Perfect 1:1 ratio (1,400 per class)',
            'splits': 'Clean DataCollection train/val/test'
        }
    }
    
    # Add all training metrics
    for key, values in history.history.items():
        training_history[key] = [float(v) for v in values]
    
    # Add best epoch metrics
    for key, values in history.history.items():
        if key.startswith('val_'):
            training_history[f'best_{key}'] = float(values[best_epoch - 1])
    
    with open('training_history.json', 'w') as f:
        json.dump(training_history, f, indent=2)
    print("‚úÖ Training history saved to: training_history.json")
    print(f"   Best epoch: {best_epoch} (validation loss: {best_val_loss:.4f})")
else:
    print("‚ö†Ô∏è  Training history not available (model was loaded from checkpoint)")
    print("   Training metrics not saved")

# 5. Summary of saved files
print(f"\nüìÅ DataCollection Results Summary:")
print("=" * 60)
files_info = [
    ("best_brain_tumor_model.keras", "Best model (DataCollection balanced data)"),
    ("test_predictions.csv", f"Test predictions ({len(results_df)} samples)"),
    ("evaluation_metrics.json", "Performance metrics"),
    ("confusion_matrices.json", "Confusion matrix data"),
    ("training_history.json", "Training history (if available)")
]

for filename, description in files_info:
    if os.path.exists(filename):
        size_mb = os.path.getsize(filename) / (1024 * 1024)
        print(f"‚úÖ {filename:<30} - {description} ({size_mb:.1f} MB)")
    else:
        print(f"‚ùå {filename:<30} - {description} (NOT FOUND)")

print(f"\nüéâ DataCollection Training & Evaluation Completed!")
print("=" * 60)
print(f"üìä Final Test Accuracy: {test_accuracy:.3f} ({test_accuracy*100:.1f}%)")
print(f"üéØ Optimal F1 Score: {optimal_f1:.3f}")
print(f"üî¨ Data Source: DataCollection balanced sampling")
print(f"‚öñÔ∏è  Class Balance: Perfect 1:1 ratio (1,400 per class)")

# Add dynamic training info if available
if 'history' in locals() and hasattr(history, 'history'):
    val_loss = history.history['val_loss']
    best_epoch = np.argmin(val_loss) + 1
    best_val_loss = min(val_loss)
    print(f"üèÜ Best Model: Epoch {best_epoch} (val_loss: {best_val_loss:.4f})")
    print(f"üìà Training Epochs: {len(history.history['loss'])}")

print(f"üìà Ready for dashboard integration!")
print(f"‚úÖ All artifacts saved with DataCollection metadata")