# 2.0 - Data Preprocessing

This notebook focuses on the **data preprocessing** stage. The goal is to take the raw data and transform it into a clean, numerical, and scaled format that is ready for machine learning.

The scope of this notebook is strictly limited to:
1.  Applying initial cleaning steps (using functions from our `.py` scripts).
2.  Performing initial feature selection to remove noise.
3.  Scaling the features to a standard range.

This notebook prepares the data for the next stage, `3.0-feature_engineering.ipynb`, where more advanced techniques like PCA will be explored.

In [None]:
import pandas as pd
import sys
import os

# Import functions from the 'src' folder.
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

# Now we can import our custom functions
from model.data_ingestion import load_raw_data
from model.data_preprocessing import drop_unnecessary_columns, map_diagnosis_to_numerical, prepare_features_and_target

# Display options for pandas
pd.set_option('display.max_columns', None)

## 1. Initial cleaning

First, we'll load the raw data and apply the pre-existing, tested functions from `src/model/data_preprocessing.py`. This ensures we are reusing our production code and maintaining consistency.

In [None]:
# Load the raw data
df_raw = load_raw_data('../data/data.csv')

df_raw.head()

In [None]:
# Apply initial cleaning steps from our script
df_cleaned = drop_unnecessary_columns(df_raw.copy())
df_mapped = map_diagnosis_to_numerical(df_cleaned.copy())

# Separate features (X) and target (y)
X, y = prepare_features_and_target(df_mapped)

print("Shape of features (X) before further preprocessing:", X.shape)
X.head()

## 2. Initial feature selection (Noise reduction)

From the EDA in `1.0-EDA.ipynb`, several features were identified with very low correlation to the target variable (`diagnosis`). These features are more likely to be noise than signal. Removing them is a standard preprocessing step that simplifies the model and potentially improves its performance by reducing noise.

**This is distinct** from *feature engineering* or *dimensionality reduction* (like PCA), where the goal is to create new, more informative features. Here, we will simply remove what appears to be irrelevant information.

In [None]:
# List of features to drop, identified from EDA (correlation < 0.10 with target)
features_to_drop = [
    'fractal_dimension_se',
    'smoothness_se',
    'fractal_dimension_mean',
    'texture_se',
    'symmetry_se'
]

# Drop the features
X_selected = X.drop(columns=features_to_drop)

print(f"Dropped {len(features_to_drop)} features.")
print("Shape of features (X) after selection:", X_selected.shape)
X_selected.head()

## 3. Feature Scaling

The features in our dataset have vastly different scales. Machine learning algorithms that use distance calculations (like SVMs or PCA) or gradient-based optimization (like logistic regression) are sensitive to this.

We will use `StandardScaler` from scikit-learn to transform the data. It standardizes features by removing the mean and scaling to unit variance. The formula for standardization is:

$$ z = \frac{x - \mu}{\sigma} $$

Where:
- $ z $ is the scaled value.
- $ x $ is the original value.
- $ \mu $ (mu) is the mean of the feature column.
- $ \sigma $ (sigma) is the standard deviation of the feature column.

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler to the data and transform it
X_scaled_array = scaler.fit_transform(X_selected)

# Convert the scaled array back to a DataFrame for better readability
X_processed = pd.DataFrame(X_scaled_array, columns=X_selected.columns)

print("Data successfully scaled.")
X_processed.head()

Let's check the summary statistics of the processed data to confirm that it has been standardized (It should show a mean close to 0 and a standard deviation close to 1).

In [None]:
X_processed.describe().T

## 4. Next steps

The data preprocessing step is complete. We have created a fully preprocessed feature set (`X_processed`) and have our target variable (`y`).

**Boundary:**
- The output of this notebook is the clean and scaled data.
- This data is the direct input for the `3.0-feature_engineering.ipynb` notebook, where we'll explore techniques like Principal Component Analysis (PCA) to create new, engineered features from this processed dataset. We will also investigate other feature creation methods.