<a href="https://colab.research.google.com/github/giannismantzaris-cmd/DAMA61/blob/main/Mantzaris_WA3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#All needed imports
import numpy as np
import pandas as pd
import time

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.tree import export_text
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import LocallyLinearEmbedding





In [2]:
# Load dataset
data = load_breast_cancer()

# Create a DataFrame with all features
X = pd.DataFrame(data.data, columns=data.feature_names)

# Target variable
y = pd.Series(data.target)

In [None]:
# Dataset shape
print("Dataset shape:", X.shape)

Dataset shape: (569, 30)


In [3]:
# Class distribution
print("Target class distribution:")
print(y.value_counts())

Target class distribution:
1    357
0    212
Name: count, dtype: int64


In [4]:
# Target label names
print("\nTarget names:", data.target_names)


Target names: ['malignant' 'benign']


The dataset consists of 569 samples, each described by 30 numerical features.
The target variable is binary, with class 0 corresponding to malignant tumors and class 1 to benign tumors.
The class distribution reveals an imbalance towards class 1 (benign tumors).

In [5]:
# Missing values
X.isna().sum()

Unnamed: 0,0
mean radius,0
mean texture,0
mean perimeter,0
mean area,0
mean smoothness,0
mean compactness,0
mean concavity,0
mean concave points,0
mean symmetry,0
mean fractal dimension,0


In [6]:
# Duplicate rows
X.duplicated().sum()

np.int64(0)

In [7]:
# Top 5 features with highest standard deviation

feature_std = X.std()
feature_std_sorted = feature_std.sort_values(ascending=False)
top5_std = feature_std_sorted.head(5)
top5_std

Unnamed: 0,0
worst area,569.356993
mean area,351.914129
area error,45.491006
worst perimeter,33.602542
mean perimeter,24.298981


In [8]:
#Stratified train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, stratify=y, random_state=42)

In [9]:
#Decision Tree without scaling
dt_no_scaling = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_no_scaling.fit(X_train, y_train)

y_pred_no_scaling = dt_no_scaling.predict(X_test)

In [10]:
# Decision Tree with RobustScaler - using a pipeline to ensure proper scaling
dt_with_scaling = Pipeline([("scaler", RobustScaler()),("tree", DecisionTreeClassifier(max_depth=3, random_state=42))])

dt_with_scaling.fit(X_train, y_train)

y_pred_with_scaling = dt_with_scaling.predict(X_test)

In [11]:
#Compaire test set predictions regarding their similarity.
check_identical = (y_pred_no_scaling == y_pred_with_scaling).all()
check_identical

np.True_

The 2 models have identical predictions. This is expected because DT models rely on threshold based splits rather than distance measures or inner products, so feature scaling does not affect their predictions.

In [12]:
# Export the tree structure in text form
tree_structure = export_text(dt_no_scaling, feature_names=list(X.columns))
tree_structure

'|--- worst radius <= 16.80\n|   |--- worst concave points <= 0.14\n|   |   |--- area error <= 91.56\n|   |   |   |--- class: 1\n|   |   |--- area error >  91.56\n|   |   |   |--- class: 0\n|   |--- worst concave points >  0.14\n|   |   |--- worst texture <= 25.62\n|   |   |   |--- class: 1\n|   |   |--- worst texture >  25.62\n|   |   |   |--- class: 0\n|--- worst radius >  16.80\n|   |--- texture error <= 0.47\n|   |   |--- class: 1\n|   |--- texture error >  0.47\n|   |   |--- worst concavity <= 0.19\n|   |   |   |--- class: 1\n|   |   |--- worst concavity >  0.19\n|   |   |   |--- class: 0\n'

In [13]:
# Total number of leaves
num_leaves = dt_no_scaling.get_n_leaves()
num_leaves

np.int64(7)

The root nodeâ€™s splitting feature is worst radius, and the tree has 7 leaves.

In [14]:
#Simple RF (no preprocessing)
rf = RandomForestClassifier(random_state=42)

param_grid_rf = {"n_estimators": [50, 100],"max_depth": [3, 5, 10]}

start_time = time.time()

grid_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring="roc_auc")

grid_rf.fit(X_train, y_train)

train_time_rf = time.time() - start_time

In [15]:
#Evaluate simple RF in test set
best_rf = grid_rf.best_estimator_

y_test_prob_rf = best_rf.predict_proba(X_test)[:, 1]
y_test_pred_rf = best_rf.predict(X_test)

roc_auc_test_rf = roc_auc_score(y_test, y_test_prob_rf)
acc_test_rf = accuracy_score(y_test, y_test_pred_rf)

In [17]:
print("Simple RF")

print("Mean ROC-AUC (training set)", grid_rf.best_score_)
print("ROC-AUC (test set)", roc_auc_test_rf)
print("Accuracy (test set)", acc_test_rf)
print("Training time (seconds)", train_time_rf)
print("Best hyperparameters", grid_rf.best_params_)

Simple RF
Mean ROC-AUC (training set) 0.9888544891640867
ROC-AUC (test set) 0.9933862433862434
Accuracy (test set) 0.956140350877193
Training time (seconds) 8.5508291721344
Best hyperparameters {'max_depth': 5, 'n_estimators': 100}


In [18]:
#RF with PCA & standard scaler
pipe_pca = Pipeline([("scaler", StandardScaler()), ("pca", PCA()), ("rf", RandomForestClassifier(random_state=42))])

param_grid_pca = {"pca__n_components": [10, 20, 30], "rf__n_estimators": [50, 100], "rf__max_depth": [3, 5, 10]}

start_time = time.time()

grid_pca = GridSearchCV(pipe_pca, param_grid_pca, cv=5, scoring="roc_auc")

grid_pca.fit(X_train, y_train)

train_time_pca = time.time() - start_time


In [19]:
#Evaluate RF with PCA & standard scaler in test set
best_pca = grid_pca.best_estimator_

y_test_prob_pca = best_pca.predict_proba(X_test)[:, 1]
y_test_pred_pca = best_pca.predict(X_test)

roc_auc_test_pca = roc_auc_score(y_test, y_test_prob_pca)
acc_test_pca = accuracy_score(y_test, y_test_pred_pca)

In [20]:
print("RF with PCA & standard scaler")

print("Mean ROC-AUC (training set)", grid_pca.best_score_)
print("ROC-AUC (test set)", roc_auc_test_pca)
print("Accuracy (test set)", acc_test_pca)
print("Training time (seconds)", train_time_pca)
print("Best hyperparameters", grid_pca.best_params_)

RF with PCA & standard scaler
Mean ROC-AUC (training set) 0.9889576883384933
ROC-AUC (test set) 0.9847883597883598
Accuracy (test set) 0.9210526315789473
Training time (seconds) 18.33741569519043
Best hyperparameters {'pca__n_components': 10, 'rf__max_depth': 10, 'rf__n_estimators': 100}


In [23]:
#RF with LLE (Locally Linear Embedding)
pipe_lle = Pipeline([("scaler", StandardScaler()), ("lle", LocallyLinearEmbedding()), ("rf", RandomForestClassifier(random_state=42))])

param_grid_lle = {"lle__n_components": [10, 15], "lle__n_neighbors": [5, 10, 15], "rf__n_estimators": [50, 100], "rf__max_depth": [3, 5, 10]}

start_time = time.time()

grid_lle = GridSearchCV(pipe_lle, param_grid_lle, cv=5, scoring="roc_auc")

grid_lle.fit(X_train, y_train)

train_time_lle = time.time() - start_time

In [24]:
#Evaluate RF with LLE & standard scaler in test set

best_lle = grid_lle.best_estimator_

y_test_prob_lle = best_lle.predict_proba(X_test)[:, 1]
y_test_pred_lle = best_lle.predict(X_test)

roc_auc_test_lle = roc_auc_score(y_test, y_test_prob_lle)
acc_test_lle = accuracy_score(y_test, y_test_pred_lle)

In [25]:
print("RF with LLE & standard scaler")

print("Mean ROC-AUC (training set)", grid_lle.best_score_)
print("ROC-AUC (test set)", roc_auc_test_lle)
print("Accuracy (test set)", acc_test_lle)
print("Training time (seconds)", train_time_lle)
print("Best hyperparameters", grid_lle.best_params_)

RF with LLE & standard scaler
Mean ROC-AUC (training set) 0.9910216718266254
ROC-AUC (test set) 0.9775132275132276
Accuracy (test set) 0.9298245614035088
Training time (seconds) 56.109917402267456
Best hyperparameters {'lle__n_components': 15, 'lle__n_neighbors': 5, 'rf__max_depth': 5, 'rf__n_estimators': 100}


Among the three Random Forest variants, the simple Random Forest without preprocessing achieved the best overall performance, as it achieved the highest ROC-AUC and accuracy on the test set while also requiring the least training time.
The PCA-based model did not improve performance, while it added computational costs, and the LLE-based model was significantly more computationally expensive while also performing slightly worse that the simple RF.
This behavior is expected, as Random Forests are well suited to tabular data and do not typically benefit from dimensionality reduction techniques.

PROBLEM 2

In [4]:
import pandas as pd

col_names = ["Profile_mean", "Profile_std", "Profile_kurtosis", "Profile_skewness", "DM_mean", "DM_std", "DM_kurtosis", "DM_skewness", "class"]

df = pd.read_csv("/content/HTRU_2.csv", header=None, names=col_names)

In [11]:
#shape of the dataset, column names and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Profile_mean      17898 non-null  float64
 1   Profile_stdev     17898 non-null  float64
 2   Profile_skewness  17898 non-null  float64
 3   Profile_kurtosis  17898 non-null  float64
 4   DM_mean           17898 non-null  float64
 5   DM_stdev          17898 non-null  float64
 6   DM_skewness       17898 non-null  float64
 7   DM_kurtosis       17898 non-null  float64
 8   class             17898 non-null  int64  
dtypes: float64(8), int64(1)
memory usage: 1.2 MB


In [6]:
#inspect a few initial values
df.head()

Unnamed: 0,Profile_mean,Profile_stdev,Profile_skewness,Profile_kurtosis,DM_mean,DM_stdev,DM_skewness,DM_kurtosis,class
0,140.5625,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,0


In [12]:
# compute descriptive statistics
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Profile_mean,17898.0,111.079968,25.652935,5.8125,100.929688,115.078125,127.085938,192.617188
Profile_stdev,17898.0,46.549532,6.843189,24.772042,42.376018,46.947479,51.023202,98.778911
Profile_skewness,17898.0,0.477857,1.06404,-1.876011,0.027098,0.22324,0.473325,8.069522
Profile_kurtosis,17898.0,1.770279,6.167913,-1.791886,-0.188572,0.19871,0.927783,68.101622
DM_mean,17898.0,12.6144,29.472897,0.213211,1.923077,2.801839,5.464256,223.392141
DM_stdev,17898.0,26.326515,19.470572,7.370432,14.437332,18.461316,28.428104,110.642211
DM_skewness,17898.0,8.303556,4.506092,-3.13927,5.781506,8.433515,10.702959,34.539844
DM_kurtosis,17898.0,104.857709,106.51454,-1.976976,34.960504,83.064556,139.30933,1191.000837
class,17898.0,0.091574,0.288432,0.0,0.0,0.0,0.0,1.0


In [14]:
#check number of missing values per column
df.isna().sum()

Unnamed: 0,0
Profile_mean,0
Profile_stdev,0
Profile_skewness,0
Profile_kurtosis,0
DM_mean,0
DM_stdev,0
DM_skewness,0
DM_kurtosis,0
class,0


In [15]:
#check for imbalance
df["class"].value_counts()

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
0,16259
1,1639


The descriptive statistics show that the features have very different scales and distributions, with some variables exhibiting large ranges and extreme values. No missing data are present, so no imputation is required.

The necessity of preprocessing depends on the method used.

For Random Forests, no feature scaling or transformation is required, as RF do not rely on distance calculations, and they are robust to differences in feature scale, skewed distributions, and outliers.

On the other hand, k-means clustering and t-SNE are distance based methods that rely on euclidean distances. As a result, they are highly sensitive to differences in feature scale, such as the oned present in our dataset. Feature standardization is necessary before applying these methods to ensure that no single feature dominates the distance calculations.

Finally, we notice that the dataset is highly imbalanced. This does not require any feature transformations, it concerns the target variable and not the input features,but it should be taken into account in the data splitting and model evaluation.