# 3.0 - Feature engineering

This notebook focuses on **feature engineering**. The goal is to use the clean, preprocessed data from the previous stage to create new, potentially more informative features using different techniques.

We will explore three main approaches:
1.  **Principal Component Analysis (PCA):** A dimensionality reduction technique to address multicollinearity.
2.  **Ratio features:** Creating new features based on domain knowledge to capture geometric properties like shape and growth.
3.  **Polynomial features:** Automatically generating interaction and higher-order features to capture non-linear relationships.


## 1. Load and preprocess data

As a best practice for reproducibility, we will start by running the complete preprocessing pipeline developed in the previous notebooks. This provides a self-contained environment and gives us the clean, scaled dataset that is the starting point for all our feature engineering experiments.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os

# Add the src directory to the Python path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

# Import custom functions
from model.data_ingestion import load_raw_data
from model.data_preprocessing import drop_unnecessary_columns, map_diagnosis_to_numerical, prepare_features_and_target

# Import necessary tools from scikit-learn
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.decomposition import PCA

# Set plotting styles
sns.set_style('whitegrid')
%matplotlib inline

# Display options for pandas
pd.set_option('display.max_columns', None)


Let's start by running the complete preprocessing pipeline that was developed and tested in `2.0-data_preprocessing.ipynb`. This will provide us with the clean, scaled dataset that serves as the starting point for our feature engineering work.

In [None]:
# Load Data
df_raw = load_raw_data('../data/data.csv')

# Initial Cleaning & Mapping
df_cleaned = drop_unnecessary_columns(df_raw.copy())
df_mapped = map_diagnosis_to_numerical(df_cleaned.copy())

# We need the original, unscaled features to create meaningful ratios
X_unscaled, y = prepare_features_and_target(df_mapped)

# Initial Feature Selection
features_to_drop = [
    'fractal_dimension_se',
    'smoothness_se',
    'fractal_dimension_mean',
    'texture_se',
    'symmetry_se'
]

X_selected_unscaled = X_unscaled.drop(columns=features_to_drop)

# Feature Scaling
scaler = StandardScaler()
X_scaled_array = scaler.fit_transform(X_selected_unscaled)
X_processed = pd.DataFrame(X_scaled_array, columns=X_selected_unscaled.columns)

print("Data successfully preprocessed.")
print("Shape of baseline processed features (X_processed):", X_processed.shape)
X_processed.head()


## 2. Technique 1: Principal Component Analysis (PCA)

First, we'll apply PCA to our baseline `X_processed` dataset to create a lower-dimensional feature set. PCA is a dimensionality reduction technique that transforms a large set of correlated variables into a smaller set of uncorrelated variables called **principal components**.

This principal components are:
- linear combinations of the original features,
- ordered by the amount of variance they explain in the data.

This is ideal for our dataset that showed high multicollinearity, especially among features related to size of tumors (radius, perimeter, and area). By using PCA, we can reduce redundancy and create a more compact and efficient feature set for our models.

A disadvantage of PCA is that the principal components are not interpretable, which can make it difficult to understand the underlying structure of the data, making the model less interpretable and harder to explain to stakeholders. We always need to aim for a model with sufficient performance that can generalize well to new data, but also take into account its interpretability.


In [None]:
# Determine optimal number of components
pca_full = PCA().fit(X_processed)
plt.figure(figsize=(10, 7))
plt.plot(range(1, len(pca_full.explained_variance_ratio_) + 1), 
         np.cumsum(pca_full.explained_variance_ratio_), 
         marker='o', linestyle='--')
plt.title('Cumulative Explained Variance by Number of Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid(True)
plt.axhline(y=0.90, color='r', linestyle=':', label='90% Explained Variance')
plt.axhline(y=0.95, color='g', linestyle=':', label='95% Explained Variance')
plt.legend()
plt.show()


In [None]:
# Apply PCA with 8 components to capture >95% of variance
pca = PCA(n_components=8)
X_pca_array = pca.fit_transform(X_processed)
pca_columns = [f'PC_{i+1}' for i in range(X_pca_array.shape[1])]
X_pca = pd.DataFrame(X_pca_array, columns=pca_columns)

print("Shape of PCA features (X_pca):", X_pca.shape)
X_pca.head()


## 3. Technique 2: Ratio Features (Shape Metrics)

Now, let's create features based on domain knowledge. We'll calculate ratios that represent tumor shape and growth as we know that these are features that can be used to predict the diagnosis. These must be calculated on the **unscaled** data to preserve their physical meaning. Afterwards, we will scale the complete dataset.


In [None]:
# Use the unscaled data with selected features
X_with_ratios_unscaled = X_selected_unscaled.copy()

# Create shape factor
# Using original feature names from before selection
X_with_ratios_unscaled['shape_factor'] = X_unscaled['perimeter_mean']**2 / X_unscaled['area_mean']

# Create growth factor
X_with_ratios_unscaled['radius_growth'] = X_unscaled['radius_worst'] / X_unscaled['radius_mean']

# Scale the entire new feature set
scaler_ratios = StandardScaler()
X_ratios_scaled_array = scaler_ratios.fit_transform(X_with_ratios_unscaled)
X_with_ratios = pd.DataFrame(X_ratios_scaled_array, columns=X_with_ratios_unscaled.columns)

print("Shape of features with ratios (X_with_ratios):", X_with_ratios.shape)
X_with_ratios.head()


## 4. Technique 3: Polynomial Features

This technique automatically creates interaction terms and higher-order features. To avoid creating too many features (the "curse of dimensionality"), we will only apply this to the top 5 most predictive features identified in our EDA.


In [None]:
# Top 5 features from EDA
top_features = [
    'concave points_worst', 
    'perimeter_worst', 
    'concave points_mean',
    'radius_worst', 
    'perimeter_mean'
]

# Initialize PolynomialFeature, degree=2 will create interaction terms (a*b) and quadratic terms (a^2)
# include_bias=False prevents adding a constant column of ones
poly = PolynomialFeatures(degree=2, include_bias=False)

# Fit and transform the top features from the SCALED data
poly_features = poly.fit_transform(X_processed[top_features])

# Create a DataFrame with the new polynomial feature names
poly_feature_names = poly.get_feature_names_out(top_features)
X_poly_generated = pd.DataFrame(poly_features, columns=poly_feature_names)

# Drop the original top 5 features from our main processed set to avoid duplication
X_processed_without_top5 = X_processed.drop(columns=top_features)

# Concatenate the two dataframes
X_poly = pd.concat([X_processed_without_top5.reset_index(drop=True), X_poly_generated.reset_index(drop=True)], axis=1)

print("Shape of polynomial features (X_poly):", X_poly.shape)
X_poly.head()


As we can see, a lot of new features are created, but we can see that the number of features is still manageable. This added complexity will be useful to capture non-linear relationships in the data, but we need to be careful not to overfit the model.

## 5. Next steps

We have engineered several new feature sets, giving us multiple candidates for our modeling stage.

**Generated Feature Sets:**
1.  **`X_processed`**: Baseline cleaned and scaled features (25 features).
2.  **`X_pca`**: PCA-transformed features (8 features).
3.  **`X_with_ratios`**: Baseline features plus our new shape and growth metrics, all scaled (27 features).
4.  **`X_poly`**: Baseline features with the top 5 replaced by their polynomial/interaction terms (40 features).

**Next Notebook:** `4.0-model_experimentation.ipynb`. In this next step, we will train and evaluate various classification models on all four of these feature sets to find the combination that yields the best performance.
