# üéóÔ∏è Breast Cancer Classification using Support Vector Machine (SVM)

## A Comprehensive Machine Learning Approach

---

**Author:** Tassawar Abbas  
**Email:** abbas829@gmail.com  
**Date:** February 2026  
**Dataset:** Breast Cancer Wisconsin (Diagnostic) Data Set

---

## üìã Project Overview

This project demonstrates the application of **Support Vector Machines (SVM)** for binary classification of breast cancer tumors. Using the Wisconsin Breast Cancer Dataset, we classify tumors as either **Benign (B)** or **Malignant (M)** based on cell nuclei characteristics extracted from digitized images.

### üéØ Objectives:
1. Perform comprehensive Exploratory Data Analysis (EDA)
2. Implement data preprocessing and feature scaling
3. Build and compare Linear vs RBF kernel SVM models
4. Evaluate model performance with medical context considerations
5. Select optimal model based on accuracy and recall metrics

### üî¨ Why SVM for Medical Diagnosis?
- **High-dimensional data handling**: Effective with 30 features
- **Memory efficient**: Uses support vectors, not entire dataset
- **Versatile**: Different kernel functions for various decision boundaries
- **Robust**: Less prone to overfitting in high-dimensional space

---

In [None]:
# üì¶ Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Machine Learning libraries
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import (accuracy_score, confusion_matrix, 
                             classification_report, roc_auc_score, 
                             roc_curve, precision_recall_curve)

# Set visualization style
sns.set_style("darkgrid")
sns.set_palette("rainbow")
plt.rcParams['figure.figsize'] = (10, 6)

print("‚úÖ All libraries imported successfully!")
print("üìä Ready to analyze Breast Cancer Wisconsin Dataset")

## 1Ô∏è‚É£ Data Loading and Initial Exploration

### Dataset Description

The Breast Cancer Wisconsin (Diagnostic) Dataset contains features computed from digitized images of fine needle aspirates (FNA) of breast mass. The features describe characteristics of the cell nuclei present in the images.

**Features include:**
- **Radius**: Mean of distances from center to points on the perimeter
- **Texture**: Standard deviation of gray-scale values
- **Perimeter**: Perimeter of the cell nucleus
- **Area**: Area of the cell nucleus
- **Smoothness**: Local variation in radius lengths
- **Compactness**: (Perimeter¬≤ / Area) - 1.0
- **Concavity**: Severity of concave portions of the contour
- **Concave Points**: Number of concave portions of the contour
- **Symmetry**: Symmetry of the cell nucleus
- **Fractal Dimension**: "Coastline approximation" - 1

Each feature has three variants: **mean**, **standard error (se)**, and **worst** (mean of the three largest values).

In [None]:
# üìÇ Load the dataset
# Note: Update the path according to your environment
df = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')

print("="*60)
print("DATASET OVERVIEW")
print("="*60)
print(f"Dataset Shape: {df.shape}")
print(f"Total Samples: {df.shape[0]}")
print(f"Total Features: {df.shape[1]}")

print("\n" + "="*60)
print("FIRST 5 ROWS:")
print("="*60)
display(df.head())

In [None]:
# üßπ Data Cleaning

print("="*60)
print("DATA CLEANING")
print("="*60)

# Remove unnecessary columns
# 'id' is just an identifier, 'Unnamed: 32' contains all NaN values
df.drop(columns=['id', 'Unnamed: 32'], axis=1, inplace=True)

print(f"\nShape after removing unnecessary columns: {df.shape}")
print("\nColumns removed:")
print("  - id: Patient identifier (not predictive)")
print("  - Unnamed: 32: Empty column with all NaN values")

print("\n" + "="*60)
print("CLEANED DATA SAMPLE:")
print("="*60)
display(df.head())

In [None]:
# üîç Data Quality Assessment

print("="*60)
print("DATA QUALITY ASSESSMENT")
print("="*60)

# Check for missing values
print("\n1. Missing Values Check:")
missing_values = df.isna().sum()
if missing_values.sum() == 0:
    print("   ‚úÖ No missing values detected!")
else:
    print(missing_values[missing_values > 0])

# Check for duplicates
print("\n2. Duplicate Records Check:")
duplicates = df.duplicated().sum()
if duplicates == 0:
    print(f"   ‚úÖ No duplicate records found!")
else:
    print(f"   ‚ö†Ô∏è  Found {duplicates} duplicate records")

# Data types
print("\n3. Data Types:")
print(f"   Numerical features: {len(df.select_dtypes(include=[np.number]).columns)}")
print(f"   Categorical features: {len(df.select_dtypes(include=['object']).columns)}")

# Target variable info
print("\n4. Target Variable (diagnosis):")
print(f"   Unique values: {df['diagnosis'].unique()}")
print(f"   M (Malignant): {(df['diagnosis'] == 'M').sum()} samples")
print(f"   B (Benign): {(df['diagnosis'] == 'B').sum()} samples")

print("\n‚úÖ Data quality assessment complete!")

In [None]:
# üìä Statistical Summary

print("="*60)
print("STATISTICAL SUMMARY")
print("="*60)

# Numerical features summary
print("\nNumerical Features Description:")
display(df.describe().round(4))

# Categorical feature summary
print("\n" + "="*60)
print("TARGET VARIABLE DISTRIBUTION:")
print("="*60)
target_counts = df['diagnosis'].value_counts()
target_percent = df['diagnosis'].value_counts(normalize=True) * 100

summary_df = pd.DataFrame({
    'Count': target_counts,
    'Percentage': target_percent.round(2)
})
display(summary_df)

print(f"\nClass Distribution:")
print(f"  B (Benign): {target_counts['B']} samples ({target_percent['B']:.2f}%)")
print(f"  M (Malignant): {target_counts['M']} samples ({target_percent['M']:.2f}%)")
print(f"\nImbalance Ratio: {target_counts['B']/target_counts['M']:.2f}:1 (Benign:Malignant)")

## 2Ô∏è‚É£ Exploratory Data Analysis (EDA) üìà

EDA helps us understand the data distribution, identify patterns, and detect relationships between features and the target variable. This is crucial for medical diagnosis applications where feature interpretability matters.

### Key Questions to Answer:
1. How is the target variable distributed?
2. Which features show the most separation between benign and malignant tumors?
3. Are there correlations between features that might indicate redundancy?
4. What are the distributions of key medical measurements?

In [None]:
# üìä Target Variable Distribution

plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x='diagnosis', hue='diagnosis', palette=['#2ecc71', '#e74c3c'])
plt.title('Distribution of Breast Cancer Diagnosis', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Diagnosis', fontsize=12)
plt.ylabel('Count', fontsize=12)

# Add count labels on bars
for i, p in enumerate(ax.patches):
    height = p.get_height()
    ax.text(p.get_x() + p.get_width()/2., height + 5,
            f'{int(height)}', ha="center", fontsize=12, fontweight='bold')

plt.legend(title='Diagnosis', labels=['Benign (B)', 'Malignant (M)'])
plt.tight_layout()
plt.show()

print("üí° Insight: The dataset is slightly imbalanced with more benign cases.")
print("   This is realistic as benign tumors are more common in practice.")

In [None]:
# üìà Distribution of Key Features by Diagnosis

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Radius Mean
sns.histplot(data=df, x='radius_mean', kde=True, hue='diagnosis', 
             palette=['#2ecc71', '#e74c3c'], ax=axes[0,0])
axes[0,0].set_title('Radius Mean Distribution', fontsize=14, fontweight='bold')
axes[0,0].set_xlabel('Radius Mean')

# Perimeter Mean
sns.histplot(data=df, x='perimeter_mean', kde=True, hue='diagnosis', 
             palette=['#2ecc71', '#e74c3c'], ax=axes[0,1])
axes[0,1].set_title('Perimeter Mean Distribution', fontsize=14, fontweight='bold')
axes[0,1].set_xlabel('Perimeter Mean')

# Area Mean
sns.histplot(data=df, x='area_mean', kde=True, hue='diagnosis', 
             palette=['#2ecc71', '#e74c3c'], ax=axes[1,0])
axes[1,0].set_title('Area Mean Distribution', fontsize=14, fontweight='bold')
axes[1,0].set_xlabel('Area Mean')

# Texture Mean
sns.histplot(data=df, x='texture_mean', kde=True, hue='diagnosis', 
             palette=['#2ecc71', '#e74c3c'], ax=axes[1,1])
axes[1,1].set_title('Texture Mean Distribution', fontsize=14, fontweight='bold')
axes[1,1].set_xlabel('Texture Mean')

plt.tight_layout()
plt.show()

print("üí° Key Observations:")
print("   ‚Ä¢ Malignant tumors tend to have larger radius, perimeter, and area")
print("   ‚Ä¢ Clear separation visible in size-related features")
print("   ‚Ä¢ Texture shows some overlap but still discriminative")

In [None]:
# üì¶ Boxplot Analysis - Worst Features

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Radius Worst
sns.boxplot(data=df, x='diagnosis', y='radius_worst', hue='diagnosis',
            palette=['#2ecc71', '#e74c3c'], ax=axes[0])
axes[0].set_title('Radius Worst Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Diagnosis')

# Perimeter Worst
sns.boxplot(data=df, x='diagnosis', y='perimeter_worst', hue='diagnosis',
            palette=['#2ecc71', '#e74c3c'], ax=axes[1])
axes[1].set_title('Perimeter Worst Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Diagnosis')

# Area Worst
sns.boxplot(data=df, x='diagnosis', y='area_worst', hue='diagnosis',
            palette=['#2ecc71', '#e74c3c'], ax=axes[2])
axes[2].set_title('Area Worst Distribution', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Diagnosis')

plt.tight_layout()
plt.show()

print("üí° Medical Insight:")
print("   'Worst' features (largest values) show even clearer separation")
print("   between benign and malignant tumors, making them highly")
print("   predictive for classification models.")

In [None]:
# üî¨ Morphological Features Analysis (Mean Values)

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Compactness Mean
sns.boxplot(data=df, x='diagnosis', y='compactness_mean', hue='diagnosis',
            palette=['#2ecc71', '#e74c3c'], ax=axes[0])
axes[0].set_title('Compactness Mean Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Diagnosis')

# Concavity Mean
sns.boxplot(data=df, x='diagnosis', y='concavity_mean', hue='diagnosis',
            palette=['#2ecc71', '#e74c3c'], ax=axes[1])
axes[1].set_title('Concavity Mean Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Diagnosis')

# Concave Points Mean
sns.boxplot(data=df, x='diagnosis', y='concave points_mean', hue='diagnosis',
            palette=['#2ecc71', '#e74c3c'], ax=axes[2])
axes[2].set_title('Concave Points Mean Distribution', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Diagnosis')

plt.tight_layout()
plt.show()

print("üí° Morphological Insights:")
print("   ‚Ä¢ Malignant tumors show higher compactness (irregular shape)")
print("   ‚Ä¢ Concavity and concave points are significantly higher in malignancy")
print("   ‚Ä¢ These shape features are crucial for cancer detection")

In [None]:
# üìä Morphological Features - Worst Values

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Compactness Worst
sns.boxplot(data=df, x='diagnosis', y='compactness_worst', hue='diagnosis',
            palette=['#2ecc71', '#e74c3c'], ax=axes[0])
axes[0].set_title('Compactness Worst Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Diagnosis')

# Concavity Worst
sns.boxplot(data=df, x='diagnosis', y='concavity_worst', hue='diagnosis',
            palette=['#2ecc71', '#e74c3c'], ax=axes[1])
axes[1].set_title('Concavity Worst Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Diagnosis')

# Concave Points Worst
sns.boxplot(data=df, x='diagnosis', y='concave points_worst', hue='diagnosis',
            palette=['#2ecc71', '#e74c3c'], ax=axes[2])
axes[2].set_title('Concave Points Worst Distribution', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Diagnosis')

plt.tight_layout()
plt.show()

print("üí° Clinical Significance:")
print("   Worst-case morphological features show the most dramatic")
print("   differences, indicating that the most abnormal cell nuclei")
print("   characteristics are strong indicators of malignancy.")

In [None]:
# üåÄ Fractal Dimension Analysis

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Fractal Dimension Mean
sns.boxplot(data=df, x='diagnosis', y='fractal_dimension_mean', hue='diagnosis',
            palette=['#2ecc71', '#e74c3c'], ax=axes[0])
axes[0].set_title('Fractal Dimension Mean Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Diagnosis')

# Fractal Dimension Worst
sns.boxplot(data=df, x='diagnosis', y='fractal_dimension_worst', hue='diagnosis',
            palette=['#2ecc71', '#e74c3c'], ax=axes[1])
axes[1].set_title('Fractal Dimension Worst Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Diagnosis')

plt.tight_layout()
plt.show()

print("üí° Fractal Analysis:")
print("   Fractal dimension measures boundary complexity.")
print("   Malignant cells tend to have more complex, irregular boundaries.")

In [None]:
# üìç Scatter Plot Analysis - Feature Relationships

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Radius vs Area
sns.scatterplot(data=df, x='radius_mean', y='area_mean', hue='diagnosis',
                palette=['#2ecc71', '#e74c3c'], s=80, alpha=0.7, ax=axes[0])
axes[0].set_title('Radius Mean vs Area Mean', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Radius Mean')
axes[0].set_ylabel('Area Mean')
axes[0].legend(title='Diagnosis')

# Concave Points vs Compactness
sns.scatterplot(data=df, x='concave points_mean', y='compactness_mean', 
                hue='diagnosis', palette=['#2ecc71', '#e74c3c'], 
                s=80, alpha=0.7, ax=axes[1])
axes[1].set_title('Concave Points Mean vs Compactness Mean', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Concave Points Mean')
axes[1].set_ylabel('Compactness Mean')
axes[1].legend(title='Diagnosis')

plt.tight_layout()
plt.show()

print("üí° Relationship Insights:")
print("   ‚Ä¢ Strong positive correlation between radius and area (expected)")
print("   ‚Ä¢ Some clustering visible but with overlap between classes")
print("   ‚Ä¢ Non-linear decision boundaries might be needed for optimal separation")

In [None]:
# üî• Correlation Heatmap

plt.figure(figsize=(20, 16))

# Calculate correlation matrix
corr_matrix = df.corr(numeric_only=True)

# Create heatmap
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))  # Mask upper triangle
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8}, fmt='.2f')

plt.title('Feature Correlation Matrix\n(Lower Triangle)', 
          fontsize=18, fontweight='bold', pad=20)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# Find highly correlated features
print("\n" + "="*60)
print("HIGH CORRELATIONS (|r| > 0.9):")
print("="*60)
high_corr = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.9:
            high_corr.append((corr_matrix.columns[i], 
                            corr_matrix.columns[j], 
                            corr_matrix.iloc[i, j]))

for feat1, feat2, corr in high_corr:
    print(f"  {feat1} <-> {feat2}: {corr:.3f}")

print("\nüí° Note: High correlations indicate multicollinearity.")
print("   This explains why non-linear models (RBF kernel) may perform better.")

## 3Ô∏è‚É£ Data Preprocessing üßπ

Proper preprocessing is crucial for SVM performance. We'll perform:

1. **Target Encoding**: Convert categorical diagnosis to binary
2. **Train-Test Split**: Separate data for training and evaluation
3. **Missing Value Imputation**: Handle any missing values (if present)
4. **Feature Scaling**: Standardization (critical for SVM)

### Why Feature Scaling is Essential for SVM:
- SVM is **distance-based** and sensitive to feature scales
- Features with larger ranges can dominate the decision boundary
- **StandardScaler** transforms features to mean=0, std=1
- Ensures all features contribute equally to the model

In [None]:
# üéØ Target Encoding

print("="*60)
print("STEP 1: TARGET ENCODING")
print("="*60)

# Map diagnosis to binary values
# M (Malignant) = 1, B (Benign) = 0
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})

print("Encoding Scheme:")
print("  M (Malignant) ‚Üí 1")
print("  B (Benign) ‚Üí 0")
print(f"\nUnique values after encoding: {df['diagnosis'].unique()}")
print(f"Data type: {df['diagnosis'].dtype}")

# Verify encoding
print("\nEncoding Verification:")
print(f"  Malignant (1): {(df['diagnosis'] == 1).sum()} samples")
print(f"  Benign (0): {(df['diagnosis'] == 0).sum()} samples")

In [None]:
# üìä Feature-Target Separation

print("="*60)
print("STEP 2: FEATURE-TARGET SEPARATION")
print("="*60)

# Separate features (X) and target (y)
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']

print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")
print(f"\nNumber of features: {X.shape[1]}")
print(f"Feature names: {list(X.columns[:5])}... (showing first 5)")

In [None]:
# ‚úÇÔ∏è Train-Test Split

print("="*60)
print("STEP 3: TRAIN-TEST SPLIT")
print("="*60)

# Split data: 75% training, 25% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples ({X_train.shape[0]/len(X):.1%})")
print(f"Test set size: {X_test.shape[0]} samples ({X_test.shape[0]/len(X):.1%})")

# Verify stratification
print("\nClass distribution in training set:")
print(f"  Benign (0): {(y_train == 0).sum()} ({(y_train == 0).mean():.1%})")
print(f"  Malignant (1): {(y_train == 1).sum()} ({(y_train == 1).mean():.1%})")

print("\nClass distribution in test set:")
print(f"  Benign (0): {(y_test == 0).sum()} ({(y_test == 0).mean():.1%})")
print(f"  Malignant (1): {(y_test == 1).sum()} ({(y_test == 1).mean():.1%})")

print("\n‚úÖ Stratified split maintains class distribution!")

In [None]:
# üõ†Ô∏è Missing Value Handling

print("="*60)
print("STEP 4: MISSING VALUE IMPUTATION")
print("="*60)

# Initialize imputer with mean strategy
imputer = SimpleImputer(strategy='mean')

# Fit on training data and transform both sets
X_train_filled = imputer.fit_transform(X_train)
X_test_filled = imputer.transform(X_test)

print("Imputation Strategy: Mean")
print("  ‚Ä¢ Training data: Fit and transform")
print("  ‚Ä¢ Test data: Transform only (prevent data leakage)")

# Check for any remaining missing values
train_missing = np.isnan(X_train_filled).sum()
test_missing = np.isnan(X_test_filled).sum()

print(f"\nRemaining missing values in training set: {train_missing}")
print(f"Remaining missing values in test set: {test_missing}")

if train_missing == 0 and test_missing == 0:
    print("\n‚úÖ All missing values successfully imputed!")

In [None]:
# üìè Feature Scaling (Standardization)

print("="*60)
print("STEP 5: FEATURE SCALING (STANDARDIZATION)")
print("="*60)

# Initialize StandardScaler
scaler = StandardScaler()

# Fit on training data and transform both sets
X_train_scaled = scaler.fit_transform(X_train_filled)
X_test_scaled = scaler.transform(X_test_filled)

print("StandardScaler Formula: z = (x - Œº) / œÉ")
print("  ‚Ä¢ Œº = mean of feature")
print("  ‚Ä¢ œÉ = standard deviation of feature")
print("\nScaling Applied:")
print(f"  Training set shape: {X_train_scaled.shape}")
print(f"  Test set shape: {X_test_scaled.shape}")

# Verify scaling
print("\nVerification (Training set statistics):")
print(f"  Mean (should be ~0): {np.mean(X_train_scaled, axis=0).mean():.6f}")
print(f"  Std (should be ~1): {np.std(X_train_scaled, axis=0).mean():.6f}")

print("\n‚úÖ Features standardized successfully!")
print("   Ready for SVM training.")

## 4Ô∏è‚É£ Model Building: Support Vector Machine (SVM) ü§ñ

We'll implement and compare two SVM variants:

### 1Ô∏è‚É£ Linear SVM
- **Kernel**: Linear
- **Decision Boundary**: Straight line (hyperplane)
- **Best for**: Linearly separable data
- **Advantages**: Fast training, interpretable coefficients

### 2Ô∏è‚É£ RBF (Radial Basis Function) SVM
- **Kernel**: RBF (Gaussian)
- **Decision Boundary**: Non-linear, flexible
- **Best for**: Complex, non-linear relationships
- **Advantages**: Handles non-linear patterns, robust to overfitting with proper tuning

### Key Hyperparameters:
- **C**: Regularization parameter (controls trade-off between smooth boundary and classifying training points correctly)
- **gamma**: Kernel coefficient for RBF (controls influence of individual training samples)

In [None]:
# üìê LINEAR SVM MODEL

print("="*60)
print("MODEL 1: LINEAR SVM")
print("="*60)

# Initialize Linear SVM
svm_linear = SVC(
    kernel='linear',
    C=1.0,
    random_state=42
)

print("Hyperparameters:")
print("  kernel: 'linear'")
print("  C: 1.0 (regularization strength)")
print("\nTraining model...")

# Train the model
svm_linear.fit(X_train_scaled, y_train)

print("‚úÖ Linear SVM trained successfully!")

# Model coefficients (only available for linear kernel)
print(f"\nModel Coefficients Shape: {svm_linear.coef_.shape}")
print(f"Intercept: {svm_linear.intercept_[0]:.4f}")

# Show top 5 most important features (by absolute coefficient value)
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'coefficient': svm_linear.coef_[0]
})
feature_importance['abs_coef'] = np.abs(feature_importance['coefficient'])
feature_importance = feature_importance.sort_values('abs_coef', ascending=False)

print("\nTop 5 Most Influential Features:")
for idx, row in feature_importance.head(5).iterrows():
    direction = "increases" if row['coefficient'] > 0 else "decreases"
    print(f"  {row['feature']}: {row['coefficient']:.4f} ({direction} malignancy probability)")

In [None]:
# üìä LINEAR SVM EVALUATION

print("="*60)
print("LINEAR SVM - MODEL EVALUATION")
print("="*60)

# Make predictions
y_pred_linear = svm_linear.predict(X_test_scaled)

# Calculate metrics
accuracy_linear = accuracy_score(y_test, y_pred_linear)
cm_linear = confusion_matrix(y_test, y_pred_linear)

print(f"\nAccuracy: {accuracy_linear:.4f} ({accuracy_linear:.2%})")

# Confusion Matrix
print("\nConfusion Matrix:")
print(cm_linear)

# Detailed classification report
print("\n" + "="*60)
print("CLASSIFICATION REPORT:")
print("="*60)
print(classification_report(y_test, y_pred_linear, 
                           target_names=['Benign (0)', 'Malignant (1)']))

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm_linear, annot=True, fmt='d', cmap='Blues',
           xticklabels=['Predicted Benign', 'Predicted Malignant'],
           yticklabels=['Actual Benign', 'Actual Malignant'])
plt.title('Linear SVM - Confusion Matrix', fontsize=14, fontweight='bold')
plt.ylabel('Actual Class')
plt.xlabel('Predicted Class')
plt.show()

# Calculate sensitivity and specificity
tn, fp, fn, tp = cm_linear.ravel()
sensitivity = tp / (tp + fn)  # Recall for malignant
specificity = tn / (tn + fp)  # Recall for benign

print(f"\nSensitivity (Recall for Malignant): {sensitivity:.4f}")
print(f"Specificity (Recall for Benign): {specificity:.4f}")
print(f"\nüí° Interpretation:")
print(f"   ‚Ä¢ Model correctly identifies {sensitivity:.1%} of malignant tumors")
print(f"   ‚Ä¢ Model correctly identifies {specificity:.1%} of benign tumors")

In [None]:
# üéØ RBF SVM MODEL

print("="*60)
print("MODEL 2: RBF SVM")
print("="*60)

# Initialize RBF SVM
svm_rbf = SVC(
    kernel='rbf',
    C=2.0,
    gamma=0.01,
    random_state=42
)

print("Hyperparameters:")
print("  kernel: 'rbf' (Radial Basis Function)")
print("  C: 2.0 (regularization strength)")
print("  gamma: 0.01 (kernel coefficient)")
print("\nTraining model...")

# Train the model
svm_rbf.fit(X_train_scaled, y_train)

print("‚úÖ RBF SVM trained successfully!")
print("\nNote: RBF kernel does not provide coef_ attribute")
print("      (non-linear decision boundary)")

In [None]:
# üìä RBF SVM EVALUATION

print("="*60)
print("RBF SVM - MODEL EVALUATION")
print("="*60)

# Make predictions
y_pred_rbf = svm_rbf.predict(X_test_scaled)

# Calculate metrics
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)
cm_rbf = confusion_matrix(y_test, y_pred_rbf)

print(f"\nAccuracy: {accuracy_rbf:.4f} ({accuracy_rbf:.2%})")

# Confusion Matrix
print("\nConfusion Matrix:")
print(cm_rbf)

# Detailed classification report
print("\n" + "="*60)
print("CLASSIFICATION REPORT:")
print("="*60)
print(classification_report(y_test, y_pred_rbf, 
                           target_names=['Benign (0)', 'Malignant (1)']))

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm_rbf, annot=True, fmt='d', cmap='Greens',
           xticklabels=['Predicted Benign', 'Predicted Malignant'],
           yticklabels=['Actual Benign', 'Actual Malignant'])
plt.title('RBF SVM - Confusion Matrix', fontsize=14, fontweight='bold')
plt.ylabel('Actual Class')
plt.xlabel('Predicted Class')
plt.show()

# Calculate sensitivity and specificity
tn_rbf, fp_rbf, fn_rbf, tp_rbf = cm_rbf.ravel()
sensitivity_rbf = tp_rbf / (tp_rbf + fn_rbf)
specificity_rbf = tn_rbf / (tn_rbf + fp_rbf)

print(f"\nSensitivity (Recall for Malignant): {sensitivity_rbf:.4f}")
print(f"Specificity (Recall for Benign): {specificity_rbf:.4f}")
print(f"\nüí° Interpretation:")
print(f"   ‚Ä¢ Model correctly identifies {sensitivity_rbf:.1%} of malignant tumors")
print(f"   ‚Ä¢ Model correctly identifies {specificity_rbf:.1%} of benign tumors")

In [None]:
# üìà MODEL COMPARISON

print("="*60)
print("MODEL COMPARISON: LINEAR VS RBF SVM")
print("="*60)

# Create comparison dataframe
comparison_data = {
    'Metric': ['Accuracy', 'Precision (Malignant)', 'Recall/Sensitivity (Malignant)', 
               'Specificity (Benign)', 'False Negatives', 'False Positives'],
    'Linear SVM': [
        f"{accuracy_linear:.4f}",
        f"{tp/(tp+fp):.4f}",
        f"{sensitivity:.4f}",
        f"{specificity:.4f}",
        f"{fn}",
        f"{fp}"
    ],
    'RBF SVM': [
        f"{accuracy_rbf:.4f}",
        f"{tp_rbf/(tp_rbf+fp_rbf):.4f}",
        f"{sensitivity_rbf:.4f}",
        f"{specificity_rbf:.4f}",
        f"{fn_rbf}",
        f"{fp_rbf}"
    ]
}

comparison_df = pd.DataFrame(comparison_data)
print("\nDetailed Comparison:")
display(comparison_df)

# Visual comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Accuracy comparison
models = ['Linear SVM', 'RBF SVM']
accuracies = [accuracy_linear, accuracy_rbf]
colors = ['#3498db', '#2ecc71']

bars = axes[0].bar(models, accuracies, color=colors, alpha=0.8, edgecolor='black')
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Model Accuracy Comparison', fontsize=14, fontweight='bold')
axes[0].set_ylim([0.95, 1.0])
axes[0].grid(axis='y', alpha=0.3)

# Add value labels on bars
for bar, acc in zip(bars, accuracies):
    height = bar.get_height()
    axes[0].text(bar.get_x() + bar.get_width()/2., height,
                f'{acc:.4f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

# Confusion matrices side by side
# Linear SVM CM
sns.heatmap(cm_linear, annot=True, fmt='d', cmap='Blues', ax=axes[1],
           xticklabels=['Pred B', 'Pred M'],
           yticklabels=['Actual B', 'Actual M'], cbar=False)
axes[1].set_title('Linear SVM Confusion Matrix', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

# RBF CM separate
plt.figure(figsize=(6, 5))
sns.heatmap(cm_rbf, annot=True, fmt='d', cmap='Greens',
           xticklabels=['Pred B', 'Pred M'],
           yticklabels=['Actual B', 'Actual M'])
plt.title('RBF SVM Confusion Matrix', fontsize=12, fontweight='bold')
plt.ylabel('Actual Class')
plt.xlabel('Predicted Class')
plt.show()

## 5Ô∏è‚É£ Analysis and Model Selection üèÜ

### üìä Performance Analysis

#### **Linear SVM Results:**
- **Accuracy**: 97.20%
- **Strengths**: Fast training, interpretable coefficients, good baseline performance
- **Weaknesses**: Assumes linear separability, limited by straight decision boundary
- **False Negatives**: 2 (missed malignant cases)

#### **RBF SVM Results:**
- **Accuracy**: 98.60%
- **Strengths**: Captures non-linear patterns, flexible decision boundary, higher accuracy
- **Weaknesses**: More computationally intensive, requires hyperparameter tuning
- **False Negatives**: 2 (same as linear, but different cases)
- **False Positives**: 0 (perfect benign classification)

### üéØ Why RBF SVM Performs Better:

1. **Non-linear Relationships**: The correlation heatmap revealed complex relationships between features that linear models cannot capture

2. **Feature Interactions**: RBF kernel implicitly maps features to higher dimensions, capturing interactions between radius, texture, and morphological features

3. **Medical Data Complexity**: Cancer cell characteristics often follow non-linear patterns that RBF handles naturally

### ‚öïÔ∏è Clinical Significance:

In medical diagnosis, **Recall (Sensitivity)** for malignant cases is critical:
- **False Negatives** (missing cancer) can be life-threatening
- **False Positives** (unnecessary anxiety/biopsies) are less dangerous

Both models show excellent sensitivity (>96%), making them suitable for clinical support tools.

In [None]:
# üèÜ FINAL SUMMARY

print("="*70)
print("PROJECT SUMMARY: BREAST CANCER CLASSIFICATION USING SVM")
print("="*70)

print("\nüìä DATASET OVERVIEW:")
print(f"  ‚Ä¢ Total samples: {len(df)}")
print(f"  ‚Ä¢ Features: {X.shape[1]} numerical features")
print(f"  ‚Ä¢ Classes: Benign ({(y==0).sum()}), Malignant ({(y==1).sum()})")

print("\nüîç KEY FINDINGS FROM EDA:")
print("  ‚Ä¢ Malignant tumors show significantly larger size (radius, area, perimeter)")
print("  ‚Ä¢ Morphological features (concavity, compactness) are strong predictors")
print("  ‚Ä¢ High multicollinearity exists between size-related features")
print("  ‚Ä¢ Data is clean with no missing values or duplicates")

print("\nü§ñ MODEL PERFORMANCE:")
print(f"  ‚Ä¢ Linear SVM Accuracy: {accuracy_linear:.2%}")
print(f"  ‚Ä¢ RBF SVM Accuracy: {accuracy_rbf:.2%}")
print(f"  ‚Ä¢ Improvement with RBF: {(accuracy_rbf-accuracy_linear):.2%}")

print("\n‚úÖ BEST MODEL: RBF SVM")
print("   Reasons for selection:")
print("   1. Higher overall accuracy (98.60%)")
print("   2. Perfect specificity (no false positives)")
print("   3. Captures non-linear patterns in medical data")
print("   4. Robust performance across all metrics")

print("\nüí° CLINICAL IMPLICATIONS:")
print(f"   ‚Ä¢ Model can correctly identify {sensitivity_rbf:.1%} of malignant tumors")
print(f"   ‚Ä¢ Model can correctly identify {specificity_rbf:.1%} of benign tumors")
print("   ‚Ä¢ Suitable as a decision support tool for medical professionals")
print("   ‚Ä¢ Reduces risk of missed cancer diagnoses")

print("\n" + "="*70)
print("END OF ANALYSIS")
print("="*70)

## 6Ô∏è‚É£ Conclusion and Future Work üöÄ

### üìù Summary

This project successfully demonstrated the application of Support Vector Machines for breast cancer classification. Through comprehensive EDA, proper preprocessing, and systematic model comparison, we achieved excellent classification performance suitable for medical decision support.

### üèÜ Key Achievements:

1. **Data Quality**: Confirmed clean dataset with 569 samples and 30 features
2. **EDA Insights**: Identified size and morphological features as key predictors
3. **Preprocessing**: Implemented proper scaling critical for SVM performance
4. **Model Optimization**: RBF kernel outperformed linear kernel (98.6% vs 97.2%)
5. **Clinical Relevance**: Achieved >96% sensitivity for malignant tumor detection

### üîÆ Future Improvements:

1. **Hyperparameter Tuning**: Use GridSearchCV or RandomizedSearchCV for optimal C and gamma
2. **Feature Selection**: Apply PCA or RFE to reduce dimensionality and multicollinearity
3. **Cross-Validation**: Implement k-fold CV for more robust performance estimates
4. **Ensemble Methods**: Combine SVM with Random Forest or XGBoost
5. **Deep Learning**: Explore neural networks for automatic feature extraction
6. **Model Interpretation**: Use SHAP values to explain individual predictions

### üìö Lessons Learned:

- **Feature Scaling is Critical**: SVM performance heavily depends on standardized features
- **Kernel Selection Matters**: RBF kernel captures non-linear patterns better than linear
- **Medical Context**: Recall is more important than precision in cancer detection
- **EDA is Essential**: Understanding feature distributions guides model selection

---

## üë§ About the Author

**Tassawar Abbas**  
üìß Email: abbas829@gmail.com  

*This notebook was created as part of a machine learning portfolio project focusing on medical diagnosis applications. The analysis demonstrates end-to-end data science workflow from exploration to model deployment-ready evaluation.*

---

**Thank you for reviewing this project!** üôè  
*Feel free to reach out for collaborations or questions about the analysis.*