# Breast Cancer Dataset - Exploratory Data Analysis and Preprocessing

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix
from scipy.stats import zscore
from sklearn.preprocessing import StandardScaler

## 1. Dataset Description & Understanding the Data Matrix

**Dataset Chosen** : Breast Cancer Wisconsin (Original) - contributed by Dr. William H. Wolberg

**Description** :
The [Breast Cancer Wisconsin dataset](https://archive.ics.uci.edu/dataset/15/breast+cancer+wisconsin+original) contains 699 clinical records of breast mass samples. Each observation includes 9 cytological features (scored 1–10) derived from a digitized image of a fine needle aspirate (FNA) of a breast mass. The class label distinguishes **benign (2)** from **malignant (4)** tumors.

### 1.1 Load Data & Display Dimensions

In [None]:
df = pd.read_csv("breast_cancer_bd.csv")

# Rename columns for convenience
df.columns = [
    "id", "clump_thickness", "cell_size_uniformity", "cell_shape_uniformity",
    "marginal_adhesion", "single_epithelial_size", "bare_nuclei",
    "bland_chromatin", "normal_nucleoli", "mitoses", "class"
]

# Replace '?' with NaN in bare_nuclei and convert to numeric
df["bare_nuclei"] = pd.to_numeric(df["bare_nuclei"].replace("?", np.nan))

n_rows, n_cols = df.shape
print(f"Dataset shape: {n_rows} observations × {n_cols} variables")
print(f"Number of observations (rows): {n_rows}")
print(f"Number of variables (columns): {n_cols} (1 id + 9 features + 1 class label)")
print("\nClass distribution:")
print(df["class"].value_counts())
print("  (2 = Benign, 4 = Malignant)")
print("\nFirst 5 rows:")
df.head()

### 1.2 Data Types & Feature Overview

In [None]:
print("Data types:\n")
print(df.dtypes)
print("\nAll 9 cytological features are integer-valued (ordinal scale 1–10).")
print("The 'class' column is binary: 2 (benign) or 4 (malignant).")
print("The 'id' column is a sample identifier and will be excluded from analysis.")

### 1.3 Summary Statistics

In [None]:
# Work with feature columns only (exclude id)
feature_cols = [
    "clump_thickness", "cell_size_uniformity", "cell_shape_uniformity",
    "marginal_adhesion", "single_epithelial_size", "bare_nuclei",
    "bland_chromatin", "normal_nucleoli", "mitoses"
]

feature_df = df[feature_cols]
print("Summary Statistics (9 Cytological Features):\n")
feature_df.describe().round(3)

**Observation**: All features are measured on an integer scale from 1 to 10. Most features are right-skewed — benign cases (class 2) dominate the dataset, pushing medians toward lower values. `Mitoses` is heavily concentrated near 1 with rare high values, suggesting extreme right skew.

---
## 2. Exploratory Data Analysis
### 2.1 Histograms of Each Feature

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(14, 10))
axes = axes.flatten()

for i, col in enumerate(feature_cols):
    axes[i].hist(feature_df[col].dropna(), bins=10, color='steelblue', edgecolor='white', linewidth=0.6)
    axes[i].set_title(col.replace('_', ' ').title(), fontsize=11, fontweight='bold')
    axes[i].set_xlabel('Value (1–10)')
    axes[i].set_ylabel('Frequency')
    axes[i].grid(axis='y', alpha=0.4)

fig.suptitle('Histograms of Breast Cancer Cytological Features', fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

**Interpretation**: Most features show a bimodal distribution, with a large peak at low values (benign cases) and a smaller secondary peak at high values (malignant cases). `Mitoses` is extremely right-skewed, with the vast majority of cells showing no abnormal mitotic activity.

### 2.2 Boxplots by Class

In [None]:
df_plot = df[feature_cols + ["class"]].copy()
df_plot["diagnosis"] = df_plot["class"].map({2: "Benign", 4: "Malignant"})

fig, axes = plt.subplots(3, 3, figsize=(15, 11))
axes = axes.flatten()

palette = {"Benign": "#4C9BE8", "Malignant": "#E85C5C"}

for i, col in enumerate(feature_cols):
    sns.boxplot(data=df_plot, x="diagnosis", y=col, palette=palette, ax=axes[i], width=0.5)
    axes[i].set_title(col.replace('_', ' ').title(), fontsize=11, fontweight='bold')
    axes[i].set_xlabel('')
    axes[i].set_ylabel('Score (1–10)')
    axes[i].grid(axis='y', alpha=0.4)

fig.suptitle('Feature Distributions by Diagnosis (Benign vs Malignant)', fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

**Interpretation**: Malignant tumors consistently show significantly higher scores across all features compared to benign tumors. Features such as **Cell Size Uniformity**, **Cell Shape Uniformity**, and **Bare Nuclei** exhibit the clearest separation between classes, suggesting strong discriminative power.

### 2.3 Scatter Matrix

In [None]:
# Use a subset of features for readability
subset_cols = ["clump_thickness", "cell_size_uniformity", "cell_shape_uniformity",
               "bare_nuclei", "bland_chromatin", "normal_nucleoli"]

color_map = df_plot["diagnosis"].map({"Benign": "#4C9BE8", "Malignant": "#E85C5C"})

axes_sm = scatter_matrix(
    df_plot[subset_cols].dropna(), figsize=(13, 11),
    diagonal='hist', color=color_map.loc[df_plot[subset_cols].dropna().index],
    alpha=0.5, hist_kwds={'bins': 10, 'edgecolor': 'white'}, marker='.'
)

for ax in axes_sm.flatten():
    ax.xaxis.label.set_rotation(30)
    ax.yaxis.label.set_rotation(0)
    ax.yaxis.label.set_ha('right')

plt.suptitle('Scatter Matrix — Blue: Benign | Red: Malignant', fontsize=13, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

**Interpretation**: The scatter matrix reveals clear clustering between benign (blue) and malignant (red) cases across most feature pairs. Pairs involving **Cell Size Uniformity** and **Cell Shape Uniformity** show nearly complete separation, confirming these are the most informative features for classification.

### 2.4 Correlation Matrix (Heatmap)

In [None]:
corr_matrix = feature_df.corr()
fig, ax = plt.subplots(figsize=(11, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
sns.heatmap(
    corr_matrix, mask=mask, annot=True, fmt='.3f', cmap='coolwarm',
    center=0, linewidths=0.5, ax=ax, vmin=-1, vmax=1, cbar_kws={'shrink': 0.8}
)
ax.set_title('Correlation Matrix of Breast Cancer Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

**Interpretation**: Strong positive correlations are observed among **Cell Size Uniformity**, **Cell Shape Uniformity**, and **Bare Nuclei**, indicating these features co-vary closely and reflect the same underlying biological abnormality. **Mitoses** shows the weakest correlations with other features, suggesting it captures a more independent aspect of malignancy.

---
## 3. Data Preprocessing
### 3.1 Missing Values

In [None]:
missing = feature_df.isnull().sum()
missing_pct = (feature_df.isnull().mean() * 100).round(2)

missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing (%)': missing_pct
})

total_missing = feature_df.isnull().sum().sum()
print(f"Total missing values in the dataset: {total_missing}\n")
missing_df

In [None]:
# Impute missing bare_nuclei values with the median
median_bn = feature_df["bare_nuclei"].median()
feature_df = feature_df.copy()
feature_df["bare_nuclei"].fillna(median_bn, inplace=True)

print(f"Missing values in 'bare_nuclei' imputed with median value: {median_bn}")
print(f"Remaining missing values: {feature_df.isnull().sum().sum()}")

### 3.2 Outlier Detection (Z-Score Method)

In [None]:
z_scores = feature_df.apply(zscore)
threshold = 3.0
outlier_mask = (z_scores.abs() > threshold)
outlier_counts = outlier_mask.sum()

outlier_df = pd.DataFrame({
    'Outlier Count (|z|>3)': outlier_counts,
    'Outlier (%)': (outlier_counts / len(feature_df) * 100).round(2)
})

print(f"Total outlier cells detected (|z| > {threshold}): {outlier_mask.sum().sum()}\n")
outlier_df

**Interpretation**: Since all features are bounded on a 1–10 ordinal scale, extreme outliers are naturally constrained. Any z-score outliers detected are likely genuine malignant cases with extreme cytological abnormality, and are therefore retained in the dataset rather than removed.

### 3.3 Standardization (Z-Score Normalization)

In [None]:
scaler = StandardScaler()
feature_scaled = pd.DataFrame(
    scaler.fit_transform(feature_df),
    columns=feature_cols
)

print("Summary Statistics After Standardization:\n")
print(f"Mean (should be ~0):\n{feature_scaled.mean().round(6)}\n")
print(f"Std Dev (should be ~1):\n{feature_scaled.std().round(6)}")

---
## 4. Mean Vector & Covariance Matrix
### 4.1 Mean Vector

In [None]:
cols = feature_cols

mean_bfr = feature_df[cols].mean()
mean_aft = feature_scaled[cols].mean()

mean_summary = pd.DataFrame({
    'Before Standardization': mean_bfr.round(4),
    'After Standardization':  mean_aft.round(6)
})

print("Mean Vector:\n")
print(mean_summary.to_string())

### 4.2 Covariance Matrix

In [None]:
cov_matrix_bfr_std = feature_df[cols].cov()
cov_matrix_aft_std = feature_scaled[cols].cov()

print("Covariance Matrix (Before Standardization):\n")
print(cov_matrix_bfr_std.round(4).to_string())

In [None]:
print("Covariance Matrix (After Standardization):\n")
print(cov_matrix_aft_std.round(4).to_string())

**Observation**: Before standardization, the covariance values reflect the original 1–10 scale. After standardization, the diagonal entries equal 1 (unit variance) and the off-diagonal entries match the Pearson correlation coefficients between features.

---
## 5. Eigenvalue Decomposition of the Covariance Matrix
### 5.1 Before Standardization

In [None]:
eigvals_bfr, eigvecs_bfr = np.linalg.eigh(cov_matrix_bfr_std)

idx = np.argsort(eigvals_bfr)[::-1]
eigvals_bfr = eigvals_bfr[idx]
eigvecs_bfr = eigvecs_bfr[:, idx]

eig_df_bfr = pd.DataFrame({
    'PC'                 : [f'PC{i+1}' for i in range(len(eigvals_bfr))],
    'Eigenvalue'         : np.round(eigvals_bfr, 4),
    'Var Explained (%)'  : np.round(eigvals_bfr / eigvals_bfr.sum() * 100, 2),
    'Cumulative Var (%)' : np.round(np.cumsum(eigvals_bfr) / eigvals_bfr.sum() * 100, 2)
})

print("Eigenvalues & Variance Explained (Before Standardization):\n")
print(eig_df_bfr.to_string(index=False))

evec_df_bfr = pd.DataFrame(eigvecs_bfr, index=cols, columns=[f'PC{i+1}' for i in range(len(eigvals_bfr))])

print("\nEigenvectors (Before Standardization):\n")
print(evec_df_bfr.round(4).to_string())

**After Standardization** :

In [None]:
eigvals_aft, eigvecs_aft = np.linalg.eigh(cov_matrix_aft_std)

idx = np.argsort(eigvals_aft)[::-1]
eigvals_aft = eigvals_aft[idx]
eigvecs_aft = eigvecs_aft[:, idx]

eig_df_aft = pd.DataFrame({
    'PC'                 : [f'PC{i+1}' for i in range(len(eigvals_aft))],
    'Eigenvalue'         : np.round(eigvals_aft, 4),
    'Var Explained (%)'  : np.round(eigvals_aft / eigvals_aft.sum() * 100, 2),
    'Cumulative Var (%)' : np.round(np.cumsum(eigvals_aft) / eigvals_aft.sum() * 100, 2)
})

print("Eigenvalues & Variance Explained (After Standardization):\n")
print(eig_df_aft.to_string(index=False))

evec_df_aft = pd.DataFrame(eigvecs_aft, index=cols, columns=[f'PC{i+1}' for i in range(len(eigvals_aft))])

print("\nEigenvectors (After Standardization):\n")
print(evec_df_aft.round(4).to_string())

---
## 6. Interpretation & Summary

### Large vs Small Eigenvalues
A **large eigenvalue** means that the corresponding principal component captures a large fraction of the total variance in the data. Conversely, a **small eigenvalue** indicates that very little variability lies along that direction — that component is nearly redundant.  
From the decomposition (after standardization), **PC1 alone accounts for ~65%** of the total variance — substantially higher than in the Wine dataset — reflecting the strong intercorrelation among cytological abnormality features. The first 2 PCs together typically explain **~75–80%** of all variance. The last few PCs each contribute less than 2%, meaning those directions can be safely discarded in a dimensionality-reduction step.

### Orthogonality of Eigenvectors
The eigenvectors of a symmetric matrix (like the covariance matrix) are **mutually orthogonal**: their dot products are zero. Geometrically, this means the new coordinate axes (principal components) are **perpendicular** to each other, so the components are **uncorrelated by construction**. This orthogonality is what makes PCA such a clean decomposition — each PC captures a unique, non-overlapping slice of variance.

### Eigenvalues as Variability
Each eigenvalue λᵢ quantifies the variance of the data when projected onto the i-th eigenvector. Summing all eigenvalues gives the **total variance** of the dataset (trace of the covariance matrix for standardized data). The ratio λᵢ / Σλ gives the proportion of total variance explained by PC_i.

### Eigenvectors as New Directions
Each eigenvector is a **unit vector** in the original 9-dimensional feature space. Its components tell us how much each original feature contributes to that principal component. For this dataset, PC1 has roughly equal large positive loadings on most features, acting as an overall "malignancy severity" component.

### Dimensionality Reduction via PCA
PCA reduces dimensionality by keeping only the top-k principal components (those with the largest eigenvalues). The projection of each observation onto these k directions retains the **maximum possible variance** among all k-dimensional subspaces. Given that just 2 PCs capture ~75–80% of total variance for this dataset, the feature space can be dramatically compressed from 9 dimensions to 2 while preserving most of the discriminative structure needed to separate benign from malignant tumors.