# 2.0 - Data Preprocessing

This notebook focuses on the **data preprocessing** stage. The goal is to take the raw data and transform it into a clean, numerical, and scaled format that is ready for machine learning.

The scope of this notebook is strictly limited to:
1.  Applying initial cleaning steps (using functions from our `.py` scripts).
2.  Performing initial feature selection to remove noise.
3.  Scaling the features to a standard range.

This notebook prepares the data for the next stage, `3.0-Feature_engineering.ipynb`, where more advanced techniques like PCA will be explored.

In [1]:
import pandas as pd
import sys
import os

# Import functions from the 'src' folder.
sys.path.append(os.path.abspath(os.path.join('..', 'src')))

# Now we can import our custom functions
from model.data_ingestion import load_raw_data
from model.data_preprocessing import drop_unnecessary_columns, map_diagnosis_to_numerical, prepare_features_and_target

# Display options for pandas
pd.set_option('display.max_columns', None)

## 1. Initial cleaning

First, we'll load the raw data and apply the pre-existing, tested functions from `src/model/data_preprocessing.py`. This ensures we are reusing our production code and maintaining consistency.

In [3]:
# Load the raw data
df_raw = load_raw_data('../data/data.csv')

df_raw.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [4]:
# Apply initial cleaning steps from our script
df_cleaned = drop_unnecessary_columns(df_raw.copy())
df_mapped = map_diagnosis_to_numerical(df_cleaned.copy())

# Separate features (X) and target (y)
X, y = prepare_features_and_target(df_mapped)

print("Shape of features (X) before further preprocessing:", X.shape)
X.head()

Shape of features (X) before further preprocessing: (569, 30)


Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## 2. Initial feature selection (Noise reduction)

From the EDA in `1.0-EDA.ipynb`, several features were identified with very low correlation to the target variable (`diagnosis`). These features are more likely to be noise than signal. Removing them is a standard preprocessing step that simplifies the model and potentially improves its performance by reducing noise.

**This is distinct** from *feature engineering* or *dimensionality reduction* (like PCA), where the goal is to create new, more informative features. Here, we will simply remove what appears to be irrelevant information.

In [5]:
# List of features to drop, identified from EDA (correlation < 0.10 with target)
features_to_drop = [
    'fractal_dimension_se',
    'smoothness_se',
    'fractal_dimension_mean',
    'texture_se',
    'symmetry_se'
]

# Drop the features
X_selected = X.drop(columns=features_to_drop)

print(f"Dropped {len(features_to_drop)} features.")
print("Shape of features (X) after selection:", X_selected.shape)
X_selected.head()

Dropped 5 features.
Shape of features (X) after selection: (569, 25)


Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,radius_se,perimeter_se,area_se,compactness_se,concavity_se,concave points_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,1.095,8.589,153.4,0.04904,0.05373,0.01587,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.5435,3.398,74.08,0.01308,0.0186,0.0134,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.7456,4.585,94.03,0.04006,0.03832,0.02058,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.4956,3.445,27.23,0.07458,0.05661,0.01867,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.7572,5.438,94.44,0.02461,0.05688,0.01885,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## 3. Feature Scaling

The features in our dataset have vastly different scales. Machine learning algorithms that use distance calculations (like SVMs or PCA) or gradient-based optimization (like logistic regression) are sensitive to this.

We will use `StandardScaler` from scikit-learn to transform the data. It standardizes features by removing the mean and scaling to unit variance. The formula for standardization is:

$$ z = \frac{x - \mu}{\sigma} $$

Where:
- $ z $ is the scaled value.
- $ x $ is the original value.
- $ \mu $ (mu) is the mean of the feature column.
- $ \sigma $ (sigma) is the standard deviation of the feature column.

In [6]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler to the data and transform it
X_scaled_array = scaler.fit_transform(X_selected)

# Convert the scaled array back to a DataFrame for better readability
X_processed = pd.DataFrame(X_scaled_array, columns=X_selected.columns)

print("Data successfully scaled.")
X_processed.head()

Data successfully scaled.


Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,radius_se,perimeter_se,area_se,compactness_se,concavity_se,concave points_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.489734,2.833031,2.487578,1.316862,0.724026,0.66082,1.88669,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,0.499255,0.263327,0.742402,-0.692926,-0.44078,0.260162,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119
2,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,1.228676,0.850928,1.181336,0.814974,0.213076,1.424827,1.51187,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,0.326373,0.286593,-0.288378,2.74428,0.819518,1.115007,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,1.270543,1.273189,1.190357,-0.04852,0.828471,1.144205,1.298575,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971


Let's check the summary statistics of the processed data to confirm that it has been standardized (It should show a mean close to 0 and a standard deviation close to 1).

In [7]:
X_processed.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
radius_mean,569.0,-1.373633e-16,1.00088,-2.029648,-0.689385,-0.215082,0.469393,3.971288
texture_mean,569.0,6.868164e-17,1.00088,-2.229249,-0.725963,-0.104636,0.584176,4.651889
perimeter_mean,569.0,-1.248757e-16,1.00088,-1.984504,-0.691956,-0.23598,0.499677,3.97613
area_mean,569.0,-2.185325e-16,1.00088,-1.454443,-0.667195,-0.295187,0.363507,5.250529
smoothness_mean,569.0,-8.366672e-16,1.00088,-3.112085,-0.710963,-0.034891,0.636199,4.770911
compactness_mean,569.0,1.873136e-16,1.00088,-1.610136,-0.747086,-0.22194,0.493857,4.568425
concavity_mean,569.0,4.995028e-17,1.00088,-1.114873,-0.743748,-0.34224,0.526062,4.243589
concave points_mean,569.0,-4.995028e-17,1.00088,-1.26182,-0.737944,-0.397721,0.646935,3.92793
symmetry_mean,569.0,1.74826e-16,1.00088,-2.744117,-0.70324,-0.071627,0.530779,4.484751
radius_se,569.0,2.372638e-16,1.00088,-1.059924,-0.623571,-0.292245,0.2661,8.906909


## 4. Next steps

The data preprocessing step is complete. We have created a fully preprocessed feature set (`X_processed`) and have our target variable (`y`).

**Boundary:**
- The output of this notebook is the clean and scaled data.
- This data is the direct input for the `3.0-Feature_engineering.ipynb` notebook, where we'll explore techniques like Principal Component Analysis (PCA) to create new, engineered features from this processed dataset. We will also investigate other feature creation methods.