# Exploratory data analysis of Breast Cancer Wisconsin Diagnostic data set

This notebook conducts an Exploratory Data Analysis (EDA) on the Breast Cancer Wisconsin (Diagnostic) dataset. 

The primary goals of this EDA are to:
1. Understand the structure and characteristics of the data.
2. Visualize the distributions of key features.
3. Analyze the relationships and correlations between different variables.
4. Identify potential data quality issues, such as missing values or outliers.
5. Form initial hypotheses that will guide the subsequent data preprocessing and feature engineering stages.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pd.set_option('display.float_format', '{:f}'.format)

# Set some plotting styles for better visuals
sns.set_style('whitegrid')
%matplotlib inline

# Define the path to save figures
FIGURES_PATH = '../reports/figures/'

In [None]:
data_path = "../data/data.csv"
df = pd.read_csv(data_path)

## 1. Initial data exploration

In [None]:
display(df.head())

print("\nDataframe info:")
df.info()

print(f"\nColumn names: {df.columns.tolist()}")

In [None]:
print("\nMissing values:")
print(df.isnull().sum())

Column "Unnamed: 32" has 569 missing values, which represents the 100% of the data, so it can be dropped. Column "id" is an identifier and does not provide any information about the patient, so it can be dropped too.

Column "diagnosis" is the target variable. I'll map the values to 0 and 1 for convenience.

## 2. Target variable analysis

In [None]:
plt.figure(figsize=(6, 4))
ax = sns.countplot(x='diagnosis', data=df)

# Add counts on top of the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha = 'center', va = 'center', xytext = (0, 9), textcoords = 'offset points')

plt.title('Distribution of diagnosis (M = Malignant, B = Benign)')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.savefig(f'{FIGURES_PATH}diagnosis_distribution.png', dpi=300)
plt.show()

# Proportion of malignant and benign cases
print(df['diagnosis'].value_counts(normalize=True))

The target variable is imbalanced, but not severely.

## 3. Summary statistics

In [None]:
pd.set_option('display.max_columns', None) # Just to show all columns
display(df.describe().T)

There are great differences in the scale of the features, so a feature scaling technique is needed. I'll use StandardScaler.

## 4. Correlation analysis

In [None]:
# First, I'll map the diagnosis to a numerical format for correlation calculation
df_corr = df.copy()
df_corr['diagnosis'] = df_corr['diagnosis'].map({'M': 1, 'B': 0})

# Drop the ID and the empty column
df_corr = df_corr.drop(['id', 'Unnamed: 32'], axis=1)

corr_matrix = df_corr.corr()

plt.figure(figsize=(18, 15))
sns.heatmap(corr_matrix, cmap='coolwarm', annot=False) # annot=False because there are too many features
plt.title('Correlation Matrix of Features')
plt.savefig(f'{FIGURES_PATH}correlation_matrix.png', dpi=300)
plt.show()

There's a strong correlation between features involving radius, perimeter and area of tumors, which is expected. Let's see the features with the strongest correlation with 'radius_mean'.

In [None]:
corr_with_radius_mean = corr_matrix['radius_mean']

# Calculate the absolute value of the correlations to measure the strength
abs_corr_with_radius_mean = corr_with_radius_mean.abs()

# Sort the absolute correlations in descending order (from highest to lowest)
most_correlated_features_radius_mean = abs_corr_with_radius_mean.sort_values(ascending=False)

print("Features sorted by it's correlation with 'radius_mean':")
display(most_correlated_features_radius_mean)

The features with the strongest correlation with 'radius_mean' are 'perimeter_mean' and 'area_mean'. This is expected because the perimeter and area of a circle are directly proportional to its radius. This also suggests that this features are redundant and good candidates for feature reduction.

Let's look at the features with the strongest correlation with 'diagnosis'.

In [None]:
corr_with_diagnosis = corr_matrix['diagnosis']

# Calculate the absolute value of the correlations to measure the strength
abs_corr_with_diagnosis = corr_with_diagnosis.abs()

# Sort the absolute correlations in descending order (from highest to lowest)
most_correlated_features_diagnosis = abs_corr_with_diagnosis.sort_values(ascending=False)

print("Features sorted by it's correlation with 'diagnosis':")
display(most_correlated_features_diagnosis)

The last 5 features (correlation < 0.10), mostly add noise to the model, they are also candidates for feature reduction.

Features to be dropped: I'll drop the features with low correlation with the target variable, but keep the ones highly correlated with each other, like 'radius_mean' and 'perimeter_mean'. These will be handled by PCA.

So, the features to be dropped are:
- fractal_dimension_se      
- smoothness_se             
- fractal_dimension_mean    
- texture_se                
- symmetry_se               

## 5. Feature distributions by diagnosis

In [None]:
df_corr = df.copy()
df_corr['diagnosis'] = df_corr['diagnosis'].map({'M': 1, 'B': 0})

# Drop non-numeric or irrelevant columns
if 'id' in df_corr.columns:
    df_corr = df_corr.drop('id', axis=1)
if 'Unnamed: 32' in df_corr.columns:
    df_corr = df_corr.drop('Unnamed: 32', axis=1)

# Calculate most correlated features with 'diagnosis'
corr_with_target = df_corr.corr()['diagnosis'].abs().sort_values(ascending=False)

# Select top 5 features
top_5_features = corr_with_target.index[1:6] # First index is 'diagnosis'
print("Top 5 most correlated features with diagnosis:")
print(top_5_features)

From the EDA, we know that this features with very different scales, so we'll separate them into small scale and large scale for better plotting.

In [None]:
features_large_scale = ['perimeter_worst', 'radius_worst', 'perimeter_mean']
features_small_scale = ['concave points_worst', 'concave points_mean']

print("Plotting features with large scales:", features_large_scale)
print("Plotting features with small scales:", features_small_scale)


df_melted_large = pd.melt(df, 
                          id_vars="diagnosis", 
                          value_vars=features_large_scale,
                          var_name="feature", 
                          value_name="value")

df_melted_small = pd.melt(df, 
                          id_vars="diagnosis", 
                          value_vars=features_small_scale,
                          var_name="feature", 
                          value_name="value")

# We create a figure with 2 rows and 1 column of plots
fig, axes = plt.subplots(2, 1, figsize=(15, 12))
fig.suptitle('Distribution of Top Features by Diagnosis (Separated by Scale)', fontsize=18)

# Plot 1: Large Scale Features
sns.boxplot(ax=axes[0], x="feature", y="value", hue="diagnosis", data=df_melted_large, palette="viridis")
axes[0].set_title('Features with Larger Value Ranges', fontsize=14)
axes[0].set_xlabel('') # Remove x-axis label for the top plot for cleanliness
axes[0].set_ylabel('Value', fontsize=12)
axes[0].grid(True)

# Plot 2: Small Scale Features
sns.boxplot(ax=axes[1], x="feature", y="value", hue="diagnosis", data=df_melted_small, palette="plasma")
axes[1].set_title('Features with Smaller Value Ranges', fontsize=14)
axes[1].set_xlabel('Feature', fontsize=12)
axes[1].set_ylabel('Value', fontsize=12)
axes[1].grid(True)

plt.tight_layout(rect=[0, 0.03, 1, 0.96]) # Adjust layout to make room for the suptitle
plt.savefig(f'{FIGURES_PATH}feature_distribution_with_target_boxplots.png', dpi=300)
plt.show()

Let's also create a pairplot of the top 5 figures:

In [None]:
top_5_features_with_diagnosis = top_5_features.tolist() + ['diagnosis']
df_pairplot = df[top_5_features_with_diagnosis]

print("Generating pair plot for top 5 features. This may take a moment...")

# The 'hue' parameter colors the data points by the 'diagnosis' column
sns.pairplot(df_pairplot, hue='diagnosis', palette='viridis', diag_kind='kde')
plt.savefig(f'{FIGURES_PATH}top_features_pairplot.png', dpi=300)
plt.show()

### Key observations from feature distribution plots

1.  **All these features are highly predictive:** Every feature visualized shows a clear and significant difference in its value distribution between Malignant (M) and Benign (B) tumors, confirming their high predictive power.

2.  **Malignant tumors are consistently larger:** Size-related features like `perimeter_worst` and `radius_worst` are consistently higher for malignant cases, with minimal overlap between the two diagnostic groups.

3.  **Concave points are a critical differentiator:** The `concave points` features show a dramatic separation between classes, with malignant tumors having substantially higher values. This makes them one of the most important indicators.

4.  **High confidence in model separability:** The distinct separation of distributions suggests that even a simple machine learning model will be able to effectively learn a boundary to distinguish between malignant and benign cases with high accuracy.

5.  **Visual confirmation of redundancy:** The fact that all size-related features tell the same story (malignant bigger than benign) visually confirms they are measuring a similar underlying concept. This further suggests the use of PCA to handle multicollinearity in the dataset.

## 6. Principal component analysis

In [None]:
# Select all numeric features
features = df_corr.select_dtypes(include='number').drop('diagnosis', axis=1, errors='ignore')
diagnosis = df_corr['diagnosis']

features_to_drop = [
    'fractal_dimension_se',
    'smoothness_se',
    'fractal_dimension_mean',
    'texture_se',
    'symmetry_se'
]
print(f"Dropping {len(features_to_drop)} noisy features before PCA.")
features_cleaned = features.drop(columns=features_to_drop)

# Scale the cleaned features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features_cleaned)

# Apply PCA to reduce to 2 dimensions
pca = PCA(n_components=2)
principal_components = pca.fit_transform(features_scaled)

# Create a new DataFrame with the principal components
df_pca = pd.DataFrame(data=principal_components, columns=['PC_1', 'PC_2'])
df_pca['diagnosis'] = diagnosis

print(f"Variance explained by PC1: {pca.explained_variance_ratio_[0]:.2%}")
print(f"Variance explained by PC2: {pca.explained_variance_ratio_[1]:.2%}")
print(f"Total variance explained by first two components: {sum(pca.explained_variance_ratio_):.2%}")

# Plot the results
plt.figure(figsize=(12, 8))
sns.scatterplot(x='PC_1', y='PC_2', hue='diagnosis', data=df_pca, alpha=0.8, palette='magma')

plt.title('2D PCA Projection of Breast Cancer Data (Noise Removed)', fontsize=16)
plt.xlabel('Principal Component 1', fontsize=12)
plt.ylabel('Principal Component 2', fontsize=12)
plt.grid(True)
plt.legend()
plt.savefig(f'{FIGURES_PATH}pca_2d_projection.png', dpi=300)
plt.show()

Let's add an extra principal component to add more captured variance.

In [None]:
features = df_corr.select_dtypes(include='number').drop('diagnosis', axis=1, errors='ignore')
diagnosis = df_corr['diagnosis']

features_to_drop = [
    'fractal_dimension_se',
    'smoothness_se',
    'fractal_dimension_mean',
    'texture_se',
    'symmetry_se'
]
print(f"Dropping {len(features_to_drop)} noisy features before PCA.")
features_cleaned = features.drop(columns=features_to_drop)

# Scale the cleaned features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features_cleaned)

# Apply PCA to reduce to 3 dimensions
pca = PCA(n_components=3)
principal_components = pca.fit_transform(features_scaled)

# Create a new DataFrame with the three principal components
df_pca = pd.DataFrame(data=principal_components, columns=['PC_1', 'PC_2', 'PC_3'])
df_pca['diagnosis'] = diagnosis

print(f"Variance explained by PC1: {pca.explained_variance_ratio_[0]:.2%}")
print(f"Variance explained by PC2: {pca.explained_variance_ratio_[1]:.2%}")
print(f"Variance explained by PC3: {pca.explained_variance_ratio_[2]:.2%}")
print(f"Total variance explained by first three components: {sum(pca.explained_variance_ratio_):.2%}")

# Plot the results in 3D
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

# Separate data for plotting
df_m = df_pca[df_pca['diagnosis'] == 1]
df_b = df_pca[df_pca['diagnosis'] == 0]

# Plot each class separately
ax.scatter(df_m['PC_1'], df_m['PC_2'], df_m['PC_3'], c='purple', label='Malignant', alpha=0.6)
ax.scatter(df_b['PC_1'], df_b['PC_2'], df_b['PC_3'], c='orange', label='Benign', alpha=0.6)

# Set labels and title
ax.set_title('3D PCA Projection of Breast Cancer Data (Noise Removed)', fontsize=16)
ax.set_xlabel('Principal Component 1', fontsize=12)
ax.set_ylabel('Principal Component 2', fontsize=12)
ax.set_zlabel('Principal Component 3', fontsize=12)
ax.legend()
ax.grid(True)

plt.savefig(f'{FIGURES_PATH}pca_3d_projection.png', dpi=300)
plt.show()

Let's determine how much PC are needed to capture ~90% of the variance.

In [None]:
pca_full = PCA().fit(features_scaled)

plt.figure(figsize=(10, 7))
plt.plot(range(1, len(pca_full.explained_variance_ratio_) + 1), 
         np.cumsum(pca_full.explained_variance_ratio_), 
         marker='o', 
         linestyle='--')

plt.title('Cumulative Explained Variance by Number of Components', fontsize=16)
plt.xlabel('Number of Components', fontsize=12)
plt.ylabel('Cumulative Explained Variance', fontsize=12)
plt.grid(True)
plt.axhline(y=0.90, color='r', linestyle=':', label='90% Explained Variance')
plt.legend()
plt.savefig(f'{FIGURES_PATH}pca_cum_variance_by_n_of_components.png', dpi=300)
plt.show()

### Key observations from PCA

1.  **High redundancy confirmed:** The steep initial curve of the explained variance plot confirms that the original features are highly correlated.

2.  **Optimal component range identified:** The "elbow" of the curve occurs between 2 and 3 components, thus indicating that the majority of the crucial information is captured within these first few components.

3.  **Data-Driven feature reduction:** The analysis provides a clear strategy for selecting the number of components. Approximately **7 components** are required to capture over 90% of the total variance, offering a robust trade-off between information retention and model simplicity.

4.  **Excellent class separability:** The 2D and 3D projections show that the data points form two distinct, well-separated clusters corresponding to the Malignant and Benign diagnoses. This gives high confidence that a machine learning model can perform well on this classification task.

## 7. Next Steps

This Exploratory Data Analysis has provided several key insights:
- The dataset is clean, with the only missing values being in a completely empty column (`Unnamed: 32`) that can be dropped.
- The target variable is slightly imbalanced, but not severely.
- Many features are highly correlated, confirming that dimensionality reduction techniques like PCA are appropriate.
- A clear separation between classes is visible, suggesting that machine learning models will perform well.
- We have identified a set of 5 features with very low correlation to the target, which can be dropped during preprocessing to reduce noise.

The next logical step is to formally preprocess this data to prepare it for modeling.

**Next Notebook:** [**2.0-data_preprocessing.ipynb**](./2.0-data_preprocessing.ipynb)