<a href="https://colab.research.google.com/github/giannismantzaris-cmd/DAMA61/blob/main/Mantzaris_WA3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [19]:
#All needed imports
import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline




In [2]:
# Load dataset
data = load_breast_cancer()

# Create a DataFrame with all features
X = pd.DataFrame(data.data, columns=data.feature_names)

# Target variable
y = pd.Series(data.target)

In [3]:
# Dataset shape
print("Dataset shape:", X.shape)

Dataset shape: (569, 30)


In [4]:
# Class distribution
print("Target class distribution:")
print(y.value_counts())

Target class distribution:
1    357
0    212
Name: count, dtype: int64


In [5]:
# Target label names
print("\nTarget names:", data.target_names)


Target names: ['malignant' 'benign']


The dataset consists of 569 samples, each described by 30 numerical features.
The target variable is binary, with class 0 corresponding to malignant tumors and class 1 to benign tumors.
The class distribution reveals an imbalance towards class 1 (benign tumors).

In [6]:
# Missing values
X.isna().sum()

Unnamed: 0,0
mean radius,0
mean texture,0
mean perimeter,0
mean area,0
mean smoothness,0
mean compactness,0
mean concavity,0
mean concave points,0
mean symmetry,0
mean fractal dimension,0


In [7]:
# Duplicate rows
X.duplicated().sum()

np.int64(0)

In [8]:
# Top 5 features with highest standard deviation

feature_std = X.std()
feature_std_sorted = feature_std.sort_values(ascending=False)
top5_std = feature_std_sorted.head(5)
top5_std

Unnamed: 0,0
worst area,569.356993
mean area,351.914129
area error,45.491006
worst perimeter,33.602542
mean perimeter,24.298981


In [11]:
#Stratified train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)

In [15]:
#Decision Tree without scaling
dt_no_scaling = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_no_scaling.fit(X_train, y_train)

y_pred_no_scaling = dt_no_scaling.predict(X_test)

In [20]:
# Decision Tree with RobustScaler - using a pipeline to ensure proper scaling
dt_with_scaling = Pipeline([("scaler", RobustScaler()),("tree", DecisionTreeClassifier(max_depth=3, random_state=42))])

dt_with_scaling.fit(X_train, y_train)

y_pred_with_scaling = dt_with_scaling.predict(X_test)

In [22]:
#Compaire test set predictions regarding their similarity.
check_identical = (y_pred_no_scaling == y_pred_with_scaling).all()
check_identical

np.True_

The 2 models have identical predictions. This is expected because DT models rely on threshold based splits rather than distance measures or inner products, so feature scaling does not affect their predictions.