# Wisconsin Breast Cancer Classification

## Introduction

This notebook focuses on classifying breast cancer tumors as malignant or benign based on features like tumor size, texture, and shape. This is a binary classification problem where:

- Malignant: Cancerous tumor
- Benign: Non-cancerous tumor

We'll explore the dataset, preprocess the data, and implement multiple machine learning models to solve this classification problem.

## Data Loading and Understanding

First, we'll import the necessary libraries and load the Wisconsin Breast Cancer dataset which is available in scikit-learn.

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Set style for plots
plt.style.use('ggplot')
sns.set_theme(style="whitegrid")

# Load the Wisconsin Breast Cancer dataset
data = load_breast_cancer()

# Create a pandas DataFrame
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

# Check the shape of the data
print(f"Dataset shape: {X.shape}")
print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]}")

# Display target distribution
target_counts = pd.Series(data.target).value_counts()
print(f"\nTarget distribution:\n{target_counts}")
print(f"Target names: {data.target_names}")

# Preview the first few rows of the dataset
X.head()

### Understanding the Features

Let's explore the dataset's features and see how they relate to the target variable. We'll first look at basic statistical information and then visualize feature distributions.

In [None]:
# Basic statistics of the features
X.describe()

In [None]:
# Create a DataFrame combining features and target for analysis
df = pd.concat([X, y], axis=1)

# Check for missing values
print("Missing values in dataset:")
print(df.isnull().sum().sum())

In [None]:
# Visualize correlations between features
plt.figure(figsize=(16, 14))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=False, cmap='coolwarm', linewidths=0.5)
plt.title('Feature Correlation Matrix')
plt.show()

In [None]:
# Visualize correlation with target
plt.figure(figsize=(10, 8))
# Sort correlations with target
target_corr = correlation_matrix['target'].sort_values(ascending=False)
# Exclude target self-correlation
target_corr = target_corr[target_corr.index != 'target']
# Plot top correlations
sns.barplot(x=target_corr.values, y=target_corr.index)
plt.title('Feature Correlations with Target')
plt.xlabel('Correlation Coefficient')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

### Feature Distribution by Target Class

Let's visualize how the distributions of key features differ between malignant and benign tumors.

In [None]:
# Select a few important features based on correlation
top_features = target_corr.index[:5]

# Create pair plots to visualize relationships
plt.figure(figsize=(15, 10))
for i, feature in enumerate(top_features):
    plt.subplot(2, 3, i+1)
    sns.histplot(data=df, x=feature, hue='target', kde=True, bins=30)
    plt.title(f"{feature} Distribution by Class")

plt.tight_layout()
plt.show()

In [None]:
# Create a pairplot for the top correlated features
subset_df = df[list(top_features) + ['target']]
sns.pairplot(subset_df, hue='target', diag_kind='kde')
plt.suptitle('Pairplot of Top Features', y=1.02)
plt.show()

## Data Preprocessing

Now that we understand our data, let's prepare it for modeling. We need to:
1. Split the data into training and testing sets
2. Scale the features to standardize their ranges

In [None]:
# Split the data into features (X) and target (y)
X = data.data
y = data.target

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Check the shapes of our splits
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

In [None]:
# Scale the features
scaler = StandardScaler()

# Fit on training data
X_train_scaled = scaler.fit_transform(X_train)

# Apply same transformation to test data
X_test_scaled = scaler.transform(X_test)

# Check the mean and standard deviation of the scaled data
print("Training data mean after scaling:", X_train_scaled.mean(axis=0)[:5], "...")
print("Training data std after scaling:", X_train_scaled.std(axis=0)[:5], "...")

## Model Development: Neural Network

Now we'll build a feedforward neural network using TensorFlow and Keras to classify the tumors.

In [None]:
# Import TensorFlow and Keras libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
import time

# Check TensorFlow version
print("TensorFlow version:", tf.__version__)

In [None]:
# Set random seed for reproducibility
tf.random.set_seed(42)
np.random.seed(42)

# Build a neural network model
def build_nn_model():
    model = Sequential([
        # Input layer (30 features)
        Dense(64, activation='relu', input_shape=(30,)),
        Dropout(0.2),  # Regularization to prevent overfitting
        Dense(32, activation='relu'),
        Dropout(0.2),
        Dense(16, activation='relu'),
        # Output layer - binary classification
        Dense(1, activation='sigmoid')
    ])
    
    # Compile the model
    model.compile(
        optimizer=Adam(learning_rate=0.001),
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    
    return model

# Create the model
nn_model = build_nn_model()

# Display the model summary
nn_model.summary()

In [None]:
# Set up early stopping to prevent overfitting
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=20,
    restore_best_weights=True
)

# Record start time for model training
start_time = time.time()

# Train the model
history = nn_model.fit(
    X_train_scaled,
    y_train,
    epochs=100,
    batch_size=16,
    validation_split=0.2,
    callbacks=[early_stopping],
    verbose=1
)

# Calculate training time
nn_training_time = time.time() - start_time
print(f"Neural Network training time: {nn_training_time:.2f} seconds")

### Visualize Training Process

In [None]:
# Plot training history
plt.figure(figsize=(12, 5))

# Plot accuracy
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()

# Plot loss
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()

plt.tight_layout()
plt.show()

## Model Evaluation

Now we'll evaluate our neural network's performance on the test set. We'll look at various metrics including accuracy, precision, recall, and the confusion matrix.

In [None]:
# Import evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Make predictions with the neural network
y_pred_nn_prob = nn_model.predict(X_test_scaled)
y_pred_nn = (y_pred_nn_prob > 0.5).astype(int).flatten()

# Calculate metrics
nn_accuracy = accuracy_score(y_test, y_pred_nn)
nn_precision = precision_score(y_test, y_pred_nn)
nn_recall = recall_score(y_test, y_pred_nn)
nn_f1 = f1_score(y_test, y_pred_nn)

# Print results
print("Neural Network Performance:")
print(f"Accuracy: {nn_accuracy:.4f}")
print(f"Precision: {nn_precision:.4f}")
print(f"Recall: {nn_recall:.4f}")
print(f"F1 Score: {nn_f1:.4f}")

# Print detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred_nn, target_names=['Malignant', 'Benign']))

In [None]:
# Create and visualize confusion matrix
cm = confusion_matrix(y_test, y_pred_nn)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Malignant', 'Benign'],
            yticklabels=['Malignant', 'Benign'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix - Neural Network')
plt.show()

### Understanding Evaluation Metrics

- **Accuracy**: The proportion of correct predictions among the total number of cases examined.
  - Formula: (TP + TN) / (TP + TN + FP + FN)

- **Precision**: The proportion of positive identifications that were actually correct.
  - Formula: TP / (TP + FP)
  - In our context: "When the model predicts a tumor is benign, how often is it correct?"

- **Recall (Sensitivity)**: The proportion of actual positives that were identified correctly.
  - Formula: TP / (TP + FN)
  - In our context: "Of all the actual benign tumors, how many did the model identify correctly?"

- **F1 Score**: The harmonic mean of precision and recall, providing a balance between them.
  - Formula: 2 * (Precision * Recall) / (Precision + Recall)

In a medical context like cancer detection, recall is particularly important as we want to minimize false negatives (missing actual cancer cases).

## Comparison with Simpler Models

Now we'll compare our neural network with simpler models: Decision Tree and Logistic Regression.

In [None]:
# Import models
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# Initialize models
dt_model = DecisionTreeClassifier(random_state=42)
lr_model = LogisticRegression(random_state=42, max_iter=1000)

# Dictionary to store results
model_results = {}

# Train and evaluate decision tree
start_time = time.time()
dt_model.fit(X_train_scaled, y_train)
dt_training_time = time.time() - start_time
y_pred_dt = dt_model.predict(X_test_scaled)

# Store decision tree results
model_results['Decision Tree'] = {
    'accuracy': accuracy_score(y_test, y_pred_dt),
    'precision': precision_score(y_test, y_pred_dt),
    'recall': recall_score(y_test, y_pred_dt),
    'f1': f1_score(y_test, y_pred_dt),
    'time': dt_training_time
}

# Train and evaluate logistic regression
start_time = time.time()
lr_model.fit(X_train_scaled, y_train)
lr_training_time = time.time() - start_time
y_pred_lr = lr_model.predict(X_test_scaled)

# Store logistic regression results
model_results['Logistic Regression'] = {
    'accuracy': accuracy_score(y_test, y_pred_lr),
    'precision': precision_score(y_test, y_pred_lr),
    'recall': recall_score(y_test, y_pred_lr),
    'f1': f1_score(y_test, y_pred_lr),
    'time': lr_training_time
}

# Add neural network results to dictionary
model_results['Neural Network'] = {
    'accuracy': nn_accuracy,
    'precision': nn_precision,
    'recall': nn_recall,
    'f1': nn_f1,
    'time': nn_training_time
}

# Print results
for model_name, metrics in model_results.items():
    print(f"\n{model_name}:")
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print(f"Precision: {metrics['precision']:.4f}")
    print(f"Recall: {metrics['recall']:.4f}")
    print(f"F1 Score: {metrics['f1']:.4f}")
    print(f"Training Time: {metrics['time']:.4f} seconds")

In [None]:
# Visualize model comparison

# Create DataFrame for visualization
metrics = ['accuracy', 'precision', 'recall', 'f1']
model_names = list(model_results.keys())
comparison_data = []

for metric in metrics:
    for model in model_names:
        comparison_data.append({
            'Model': model,
            'Metric': metric.capitalize(),
            'Value': model_results[model][metric]
        })
        
comparison_df = pd.DataFrame(comparison_data)

# Plot comparison
plt.figure(figsize=(12, 8))

# Performance metrics comparison
plt.subplot(2, 1, 1)
sns.barplot(x='Metric', y='Value', hue='Model', data=comparison_df)
plt.title('Model Performance Comparison')
plt.ylim(0.9, 1.0)  # Adjust as needed based on actual results
plt.legend(title='Model')

# Training time comparison
plt.subplot(2, 1, 2)
training_times = [model_results[model]['time'] for model in model_names]
plt.bar(model_names, training_times)
plt.title('Model Training Time Comparison')
plt.ylabel('Time (seconds)')
plt.xlabel('Model')

plt.tight_layout()
plt.show()

## Analysis and Conclusion

### Neural Network Effectiveness

The neural network's effectiveness for this problem can be evaluated from several perspectives:

**Strengths of Neural Networks for this problem:**
1. **Complex Pattern Recognition**: Neural networks excel at capturing complex, non-linear relationships between features, which may be present in medical data.
2. **Feature Learning**: The hidden layers can automatically learn useful intermediate representations of the data.
3. **Robustness**: With regularization techniques like dropout, neural networks can be robust against overfitting, especially important in medical contexts.

**Limitations:**
1. **Computation Cost**: As we've seen in the training time comparison, neural networks typically take longer to train compared to simpler models.
2. **Interpretability**: Neural networks are often considered "black boxes" - it's difficult to explain precisely why they make specific predictions, which can be problematic in medical applications.
3. **Data Requirements**: They typically perform best with large amounts of data, which may not always be available in medical contexts.

## Analysis and Reflection

### Model Effectiveness Comparison

In this analysis, we've compared three different models for breast cancer classification:

1. **Neural Network**: A multi-layer feedforward model with dropout for regularization
2. **Logistic Regression**: A linear model that's often effective for binary classification
3. **Decision Tree**: A non-linear model that creates decision boundaries based on feature thresholds

### Key Findings

- **Performance Metrics**: All three models achieved high accuracy, precision, and recall on the test set, suggesting that the breast cancer classification task is well-suited for machine learning approaches.

- **Complexity vs. Performance**: The neural network, despite being more complex, didn't necessarily outperform the simpler models by a significant margin. This is an important observation as it demonstrates that simpler models can sometimes be equally effective for certain problems.

- **ROC Curves**: The ROC curves and AUC scores show that all models are good at distinguishing between the two classes, with possibly slight advantages for certain models.

### Model Suitability for This Problem

- **Neural Network**: Offers flexibility and can capture complex patterns, but may be overkill for this particular dataset. The added complexity comes with longer training times and more hyperparameters to tune.

- **Logistic Regression**: Provides excellent performance with much less complexity. It's easier to interpret and faster to train, making it a strong candidate for this problem.

- **Decision Tree**: Also performs well and offers interpretability through its decision rules. However, it might be more prone to overfitting on different datasets.

### Conclusion

For the breast cancer classification task, the simpler models (especially Logistic Regression) appear to offer a good balance between performance and complexity. The neural network doesn't provide enough additional benefit to justify its complexity in this case.

This highlights an important principle in machine learning: always start with simpler models and only move to more complex ones if there's a clear benefit. The "right" model depends not only on performance metrics but also on factors like interpretability, training time, and ease of deployment.