##  Module 5 Worksheet - Chapter 9

The three checkpoints included in this worksheet need to be completed and marked during your lab session.

### Checkpoint 1 - Cross Validation

Load the California Housing regression dataset (<code>datasets.fetch_california_housing()</code>) and train a KNeighborsRegressor, LinearRegression and DecisionTreeRegressor models to predict the median house price of a block group instance.

Compare the results and runtimes when performing the following model evaluation procedures:
- Evaluate using train/test splitting (holdout 10% for testing)
- Evaluate using K-fold cross validation (try for K = 10, 100 and 1000)

In [1]:
# Enter your code for Checkpoint 1 here

import time
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

# --- Load and Prepare the Data ---
housing = datasets.fetch_california_housing()
X = housing.data
y = housing.target

# Split the data into training and testing sets (10% for testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state=42
)

# Initialize the three models
models = {
    "LinearRegression": LinearRegression(),
    "KNeighborsRegressor": KNeighborsRegressor(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=42),
}

# Evaluate using Train/Test Splitting 
print("--- Train/Test Split Evaluation ---")
for name, model in models.items():
    start_time = time.time()
    # Train the model
    model.fit(X_train, y_train)
    # Make predictions
    y_pred = model.predict(X_test)
    # Calculate the R-squared score
    r2 = r2_score(y_test, y_pred)
    runtime = time.time() - start_time
    print(f"  {name:<20} | R-squared: {r2:.4f} | Runtime: {runtime:.4f}s")
    

--- Train/Test Split Evaluation ---
  LinearRegression     | R-squared: 0.5808 | Runtime: 0.0313s
  KNeighborsRegressor  | R-squared: 0.1712 | Runtime: 0.0545s
  DecisionTreeRegressor | R-squared: 0.6357 | Runtime: 0.3290s


In [2]:
# Evaluate using K-Fold Cross-Validation
print("\n--- K-Fold Cross-Validation Evaluation ---")

# Define the values for K to test
k_values = [10, 100, 1000]

for k in k_values:
    print(f"\nEvaluating with K = {k}...")
    
    # Create a new KFold object for each 'k'
    cv = KFold(n_splits=k, shuffle=True, random_state=42)

    for name, model in models.items():
        start_time = time.time()
        # Perform cross-validation with 'r2' scoring and shuffle=True
        scores = cross_val_score(
            model,
            X,
            y,
            cv=cv,  
            scoring="r2",
            n_jobs=-1,  # Use all available CPU cores for speed
        )
        runtime = time.time() - start_time
        
        r2_mean = np.mean(scores)
        r2_std = np.std(scores)
        print(
            f"  {name:<20} | Mean R-squared: {r2_mean:.4f} | Std R-squared: {r2_std:.4f} | Runtime: {runtime:.4f}s"
        )


--- K-Fold Cross-Validation Evaluation ---

Evaluating with K = 10...
  LinearRegression     | Mean R-squared: 0.6001 | Std R-squared: 0.0222 | Runtime: 3.5031s
  KNeighborsRegressor  | Mean R-squared: 0.1684 | Std R-squared: 0.0190 | Runtime: 1.9828s
  DecisionTreeRegressor | Mean R-squared: 0.6039 | Std R-squared: 0.0167 | Runtime: 0.5911s

Evaluating with K = 100...
  LinearRegression     | Mean R-squared: 0.6013 | Std R-squared: 0.0631 | Runtime: 0.3571s
  KNeighborsRegressor  | Mean R-squared: 0.1732 | Std R-squared: 0.0688 | Runtime: 0.6696s
  DecisionTreeRegressor | Mean R-squared: 0.6073 | Std R-squared: 0.0789 | Runtime: 4.0962s

Evaluating with K = 1000...
  LinearRegression     | Mean R-squared: 0.5646 | Std R-squared: 0.2585 | Runtime: 1.3147s
  KNeighborsRegressor  | Mean R-squared: 0.1011 | Std R-squared: 0.2748 | Runtime: 4.5254s
  DecisionTreeRegressor | Mean R-squared: 0.5558 | Std R-squared: 0.2970 | Runtime: 38.8834s


### Checkpoint 2 - Model Evaluation Metrics

Load the UCI Breast Cancer Wisconsin (Diagnostic) classification dataset (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) and train a RandomForestClassifier model to predict whether the cancer is malignant (0) or benign (1).

Evaluate the performance of the model using the following metrics (use stratified 10-fold cross validation):
- Accuracy
- Precision
- Recall
- F1-Score

---

We can create a naive (dummy) classifier model that always predicts the most common label using the following code:

```python
from sklearn.dummy import DummyClassifier
breast_cancer = datasets.load_breast_cancer()
dc = DummyClassifier(strategy = 'most_frequent')
dc.fit(breast_cancer.data, breast_cancer.target)
```

For this dataset, the DummyClassifier model will always output 1 regardless of what the input feature vector is.

Calculate the Accuracy, Precision, Recall and F1-Score for this DummyClassifier model (use stratified 10-fold cross validation).

In [3]:
# Enter your code for Checkpoint 2 here

import numpy as np
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Load the dataset
breast_cancer = datasets.load_breast_cancer()

# Create a Dummy Classifier
# A DummyClassifier is a baseline model used to compare against "real" models.
# Here we use strategy='most_frequent', which always predicts the majority class
# found in the training set. This provides a minimum benchmark for accuracy.
dc = DummyClassifier(strategy='most_frequent')
dc.fit(breast_cancer.data, breast_cancer.target)



# Prepare Features (X) and Labels (y)
# X contains the input features (e.g., cell nucleus characteristics).
# y contains the target labels (0 = malignant, 1 = benign).
X = breast_cancer.data
y = breast_cancer.target

In [4]:
# Set up the classifiers
# A powerful ensemble method that builds multiple decision trees and
# combines their predictions for better accuracy and robustness.
# Setting random_state=42 ensures reproducible results.
rf_classifier = RandomForestClassifier(random_state=42)

# Dummy Classifier
# A baseline model that makes simple predictions without using input features.
# By default, strategy='prior' (predicts classes according to training distribution).
# Setting random_state=42 makes its behavior reproducible if random strategies are used.
dc_classifier = DummyClassifier(random_state=42)



# Set up stratified 10-fold cross-validation
# StratifiedKFold ensures that each fold has approximately the same percentage
# of samples from each class as the original dataset.
# This is important for imbalanced datasets like breast cancer (more benign than malignant).
# - n_splits=10 → split the dataset into 10 folds
# - shuffle=True → shuffle data before splitting (prevents bias from ordering)
# - random_state=42 → ensures the same shuffling every run (reproducibility)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)


In [5]:
# Define the scoring metrics
# Evaluating the model using four common classification metrics:
# - accuracy:   Overall proportion of correct predictions
# - precision:  Of the positive predictions, how many were correct
# - recall:     Of the actual positives, how many were detected
# - f1:         Harmonic mean of precision and recall (balances both)
scoring = ['accuracy', 'precision', 'recall', 'f1']



# Evaluate the RandomForestClassifier
print("--- RandomForestClassifier Evaluation ---")

# Loop through each scoring metric
for metric in scoring:
    # Perform stratified 10-fold cross-validation
    # - rf_classifier: the model being evaluated
    # - X, y: features and target labels
    # - cv=skf: use stratified 10-fold CV (keeps class balance in each fold)
    # - scoring=metric: specify which performance metric to calculate
    # - n_jobs=-1: use all CPU cores for faster computation
    scores = cross_val_score(rf_classifier, X, y, cv=skf, scoring=metric, n_jobs=-1)
    
    # Print mean and standard deviation of the metric across the 10 folds
    # - Mean shows average model performance
    # - Std shows variability (how stable the model is across folds)
    print(f"  {metric.capitalize():<10}: Mean = {np.mean(scores):.4f}, Std = {np.std(scores):.4f}")


--- RandomForestClassifier Evaluation ---
  Accuracy  : Mean = 0.9561, Std = 0.0239
  Precision : Mean = 0.9627, Std = 0.0331
  Recall    : Mean = 0.9692, Std = 0.0232
  F1        : Mean = 0.9654, Std = 0.0183


In [6]:
# Evaluate the DummyClassifier
print("\n--- DummyClassifier Evaluation ---")

# Iterate through each metric in the 'scoring' list
# The 'scoring' list likely contains strings like 'accuracy', 'f1', 'precision', etc.
for metric in scoring:
    # Perform cross-validation to get performance scores for the DummyClassifier
    # 'dc_classifier': The DummyClassifier model instance
    # 'X': The feature data
    # 'y': The target labels
    # 'cv=skf': Use StratifiedKFold cross-validation for balanced folds
    # 'scoring=metric': Use the current metric from the loop for evaluation
    # 'n_jobs=-1': Use all available CPU cores for faster processing
    scores = cross_val_score(dc_classifier, X, y, cv=skf, scoring=metric, n_jobs=-1)
    
    # Print the mean and standard deviation of the scores for the current metric
    # The mean shows the average performance, and the standard deviation shows the variability
    # {metric.capitalize():<10}: Format the metric name with a capital letter and left-align it
    # {np.mean(scores):.4f}: Format the mean score to 4 decimal places
    # {np.std(scores):.4f}: Format the standard deviation to 4 decimal places
    print(f"  {metric.capitalize():<10}: Mean = {np.mean(scores):.4f}, Std = {np.std(scores):.4f}")


--- DummyClassifier Evaluation ---
  Accuracy  : Mean = 0.6274, Std = 0.0070
  Precision : Mean = 0.6274, Std = 0.0070
  Recall    : Mean = 1.0000, Std = 0.0000
  F1        : Mean = 0.7710, Std = 0.0053


### Checkpoint 3 - Grid Search

Load the California Housing regression dataset (<code>datasets.fetch_california_housing()</code>) and train a RandomForestRegressor model that predicts the median house price of a block group instance. Use the following Grid Search approach to identify the most effective combination of hyperparameters:

1) Call "np.random.seed(42)" to fix the random seed.
2) Split the data into training and test sets (test_size = 0.2).
3) Standardise the feature values for the training and test sets, based on the training set values.
4) Define a RandomForestRegressor object, and a KFold object (use K = 5 and shuffle the data order).
5) Define the following Dictionary object, that specifies the possible hyperparameter values that will be evaluated: <code>grids = {'n_estimators': [10, 50, 100], 'min_samples_leaf': [2, 10]}</code>.
6) Define a GridSearchCV object, using the previously defined RandomForestRegressor, KFold and Dictionary (grids) objects as arguments, with the scoring metric to 'r2'. Hint, set the parameter <code>n_jobs=-1</code> to parallelize the evaluations and reduce program runtime.
7) Fit this GridSearchCV object to the training features and labels.
8) Report the hyperparameter combination that produces the highest R^2 score.

---

Repeat the above process on the Iris dataset for a distance weighted K-neighbours classifier with the following hyperparameters space:
- distance metric = [euclidean, cosine, manhattan, minkowski]
- K = [1, 3, 5, 10, 50]

Use accuracy as the scoring metric for comparing hyperparameter combinations.
Try using both 'accuracy' and 'f1_macro' (f1 score with handling for multi-class targets) as the scoring metrics for comparing hyperparameter combinations. Do you get a different "best" hyperparameter combination for each scoring metric?

---

Repeat the above process on the UCI Breast Cancer Wisconsin (Diagnostic) dataset for a decision tree classifier with the following hyperparameters space:
- max_depth = [1, 2, ..., 49, 50]
- min_samples_split = [2, 3, ..., 31, 32]
- min_samples_leaf = [1, 2, ..., 49, 50]
- max_leaf_nodes = [2, 3, ..., 127, 128]

Note, unless your computer is VERY powerful, the sheer number of possible combinations for the above hyperparameter values makes regular cross-validation Grid Search infeasible to perform. Instead, you should use RandomizedSearchCV to randomly sample from the space of possible hyperparameter combinations. You can select the number of hyperparameter combinations that are sampled by changing the "n_iter" parameter (recommend starting out with n_iter=1000).

Try using both 'accuracy' and 'f1' (f1 score) as the scoring metrics for comparing hyperparameter combinations. Do you get a different "best" hyperparameter combination for each scoring metric?

In [7]:
# Enter your code for Checkpoint 3 here
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
california_housing = datasets.fetch_california_housing()
X, y = california_housing.data, california_housing.target


In [8]:
# Splite to test and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [9]:
import numpy as np
from sklearn.datasets import load_breast_cancer

# 1. Load the dataset
breast_cancer = load_breast_cancer()
y = breast_cancer.target  # The target variable contains the class labels

# 2. Count the occurrences of each class
unique_classes, counts = np.unique(y, return_counts=True)

# 3. Find the most frequent class
most_frequent_class_index = np.argmax(counts)
most_frequent_class_label = unique_classes[most_frequent_class_index]
most_frequent_class_name = breast_cancer.target_names[most_frequent_class_label]

# 4. Print the results
print(f"Class labels: {breast_cancer.target_names}")
print(f"Counts for each class: {counts}")
print("-" * 30)
print(f"The most frequent class is '{most_frequent_class_name}' with {counts[most_frequent_class_index]} instances.")

Class labels: ['malignant' 'benign']
Counts for each class: [212 357]
------------------------------
The most frequent class is 'benign' with 357 instances.


In [10]:
# Define the RandomForestRegressor
# RandomForestRegressor is an ensemble model that builds multiple decision trees
# and averages their predictions to improve accuracy and reduce overfitting.
# Setting random_state=42 ensures reproducibility (same results each run).
rfr = RandomForestRegressor(random_state=42)



# Define the KFold cross-validation strategy
# KFold will split the dataset into 5 folds (subsets).
# - n_splits=5: divides data into 5 equal parts (80% train, 20% test per fold)
# - shuffle=True: randomly shuffle the data before splitting to reduce bias
# - random_state=42: ensures consistent shuffling every time for reproducibility
kfold = KFold(n_splits=5, shuffle=True, random_state=42)



# Define the hyperparameter grid for tuning
# The grid specifies different combinations of hyperparameters to try
# during model selection (GridSearchCV).
# - n_estimators: number of trees in the forest (10, 50, 100)
# - min_samples_leaf: minimum number of samples required in a leaf node (2, 10)
# Grid search will test all possible combinations of these values.
grids = {
    'n_estimators': [10, 50, 100],
    'min_samples_leaf': [2, 10]
}


In [11]:
# Initialize GridSearchCV
# GridSearchCV will perform an exhaustive search over the hyperparameter grid.
# Parameters:
# - rfr: the RandomForestRegressor model we defined earlier
# - grids: the dictionary of hyperparameter values to try
# - cv=kfold: use 5-fold cross-validation for evaluation
# - scoring='r2': optimize based on the R² score (coefficient of determination)
# - n_jobs=-1: use all available CPU cores for faster computation
gscv = GridSearchCV(rfr, grids, cv=kfold, scoring='r2', n_jobs=-1)



# Fit the GridSearchCV to training data
# Fit the model to the scaled training dataset.
# - X_train_scaled: training features (standardized/scaled)
# - y_train: training target values
# During this step:
#   - GridSearchCV will try every combination of hyperparameters in `grids`
#   - For each combination, it will run cross-validation (5 folds)
#   - It evaluates each model using the R² score
#   - The best-performing set of hyperparameters is stored
gscv.fit(X_train_scaled, y_train)



# Print the best hyperparameters found
# After fitting, GridSearchCV stores the best parameter set in `.best_params_`
# Here we print the combination of n_estimators and min_samples_leaf
# that gave the highest R² score during cross-validation.
print("California Housing: Best hyperparameters for R2 score:", gscv.best_params_)


California Housing: Best hyperparameters for R2 score: {'min_samples_leaf': 2, 'n_estimators': 100}


In [12]:
# K-Nearest Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

iris = datasets.load_iris()
X_iris, y_iris = iris.data, iris.target

X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)

# ------------------------------
# Create a pipeline with scaling + KNN
# ------------------------------

# A Pipeline allows us to chain multiple preprocessing and modeling steps
# so they are executed together inside cross-validation (avoiding data leakage).
# Steps in this pipeline:
# - 'scaler': StandardScaler() standardizes the features 
#             (mean=0, variance=1) → important for distance-based models like KNN
# - 'knn': KNeighborsClassifier() performs classification based on nearest neighbors
knn_pipe = Pipeline([
    ('scaler', StandardScaler()),       # Step 1: Standardize input features
    ('knn', KNeighborsClassifier())     # Step 2: Apply KNN classifier
])



# Define hyperparameter grid for KNN
# Using the double underscore (__) notation to access parameters
# of the 'knn' step inside the pipeline.
# - 'knn__metric': Distance metrics to test:
#     * 'euclidean'  → straight-line distance
#     * 'cosine'     → similarity based on angle between vectors
#     * 'manhattan'  → sum of absolute differences (a.k.a. L1 norm)
#     * 'minkowski'  → generalization of Euclidean and Manhattan (controlled by p)
# - 'knn__n_neighbors': Number of nearest neighbors to consider for classification.
#   Trying a range from very local (1) to broader (50).
grids_knn = {
    'knn__metric': ['euclidean', 'cosine', 'manhattan', 'minkowski'],
    'knn__n_neighbors': [1, 3, 5, 10, 50]
}



In [13]:
# Set up GridSearchCV for Accuracy
# GridSearchCV will try every combination of hyperparameters from grids_knn
# and evaluate them using 5-fold cross-validation.
# Parameters:
# - knn_pipe: pipeline (scaler + KNN) ensures preprocessing happens within CV
# - grids_knn: hyperparameter grid (metrics + number of neighbors)
# - cv=5: 5-fold cross-validation (train on 80%, validate on 20%, repeat 5 times)
# - scoring='accuracy': optimize for classification accuracy
# - n_jobs=-1: use all available CPU cores to run in parallel for speed
gscv_acc = GridSearchCV(knn_pipe, grids_knn, cv=5, scoring='accuracy', n_jobs=-1)



# Fit the GridSearchCV to the training data
# Train and evaluate KNN models using all parameter combinations
# - X_train_iris: training features from the Iris dataset
# - y_train_iris: training labels (species of Iris flowers)
# During this step, GridSearchCV:
#   - Runs 5-fold cross-validation for each parameter combination
#   - Computes average accuracy for each
#   - Stores the best parameter combination
gscv_acc.fit(X_train_iris, y_train_iris)



# Print the best hyperparameters
# The attribute .best_params_ contains the hyperparameter set that
# achieved the highest mean accuracy score across CV folds.
print("\nIris Dataset: Best hyperparameters for Accuracy:", gscv_acc.best_params_)



Iris Dataset: Best hyperparameters for Accuracy: {'knn__metric': 'manhattan', 'knn__n_neighbors': 10}


In [14]:
# Set up GridSearchCV for F1-macro
# GridSearchCV will test every combination of hyperparameters in grids_knn
# and evaluate them using 5-fold cross-validation.
# Parameters:
# - knn_pipe: pipeline (scaling + KNN model)
# - grids_knn: hyperparameter grid (distance metrics + number of neighbors)
# - cv=5: 5-fold cross-validation (split data into 5 train/validation folds)
# - scoring='f1_macro': optimize for macro-averaged F1 score
#       * Macro F1 = average of F1 scores across all classes (equal weight)
#       * Good for multi-class classification like Iris (3 species)
# - n_jobs=-1: use all CPU cores to speed up computation
gscv_f1 = GridSearchCV(knn_pipe, grids_knn, cv=5, scoring='f1_macro', n_jobs=-1)



# Fit the GridSearchCV to training data
# Train and evaluate KNN models using all parameter combinations
# - X_train_iris: training features from the Iris dataset
# - y_train_iris: training labels (species of Iris flowers)
# GridSearchCV will:
#   - Run 5-fold CV for each parameter combination
#   - Compute the average F1-macro score
#   - Select the parameter set with the highest score
gscv_f1.fit(X_train_iris, y_train_iris)



# Print the best hyperparameters
# .best_params_ stores the hyperparameter set that produced the
# highest mean F1-macro score across cross-validation folds.
print("Iris Dataset: Best hyperparameters for F1-macro:", gscv_f1.best_params_)


Iris Dataset: Best hyperparameters for F1-macro: {'knn__metric': 'manhattan', 'knn__n_neighbors': 10}


In [15]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint


# Load Breast Cancer Dataset
# The breast cancer dataset is built into scikit-learn.
# - Features: computed from cell nuclei of breast cancer biopsies
# - Target: 0 = malignant, 1 = benign
breast_cancer = datasets.load_breast_cancer()
X_bc, y_bc = breast_cancer.data, breast_cancer.target



# Initialize Decision Tree Classifier
# DecisionTreeClassifier is a simple, interpretable model.
# Setting random_state=42 ensures reproducibility (consistent splits).
dtc = DecisionTreeClassifier(random_state=42)



# Define Hyperparameter Search Space
# Instead of testing every possible value (GridSearch),
# we use RandomizedSearchCV with random distributions.
# Each parameter is sampled from a specified range (using scipy.stats.randint).
grids_dtc = {
    'max_depth': randint(1, 51),          # Depth of the tree (1–50)
    'min_samples_split': randint(2, 33),  # Minimum samples required to split a node (2–32)
    'min_samples_leaf': randint(1, 51),   # Minimum samples in a leaf node (1–50)
    'max_leaf_nodes': randint(2, 129)     # Maximum number of leaf nodes (2–128)
}


In [16]:
# Randomized Search for Accuracy
# RandomizedSearchCV will sample random combinations of hyperparameters
# from the specified distributions (grids_dtc) instead of exhaustively testing all.
# Parameters:
# - dtc: the DecisionTreeClassifier model
# - grids_dtc: dictionary of hyperparameter distributions (randint ranges)
# - n_iter=1000: number of random combinations to try (larger = more thorough, but slower)
# - cv=5: 5-fold cross-validation (ensures balanced evaluation)
# - scoring='accuracy': optimize hyperparameters based on accuracy
# - n_jobs=-1: use all CPU cores for faster computation
# - random_state=42: ensures reproducibility (same random samples across runs)
rscv_acc = RandomizedSearchCV(
    dtc, grids_dtc, n_iter=1000, cv=5,
    scoring='accuracy', n_jobs=-1, random_state=42
)



# Fit the RandomizedSearchCV to the dataset
# Fit the model on the entire breast cancer dataset.
# RandomizedSearchCV will:
#   - Randomly sample 1000 hyperparameter combinations
#   - For each combination, run 5-fold cross-validation
#   - Compute accuracy for each fold
#   - Select the hyperparameters with the best mean accuracy
rscv_acc.fit(X_bc, y_bc)



# Print the best hyperparameters found
# The attribute .best_params_ stores the set of hyperparameters
# that achieved the highest average accuracy across cross-validation folds.
print("\nBreast Cancer: Best hyperparameters for Accuracy:", rscv_acc.best_params_)



Breast Cancer: Best hyperparameters for Accuracy: {'max_depth': 38, 'max_leaf_nodes': 110, 'min_samples_leaf': 8, 'min_samples_split': 28}


In [17]:
# Randomized Search for F1 score
# RandomizedSearchCV will optimize hyperparameters for the DecisionTreeClassifier,
# this time using the F1 score as the evaluation metric.
# Parameters:
# - dtc: DecisionTreeClassifier model
# - grids_dtc: dictionary of hyperparameter distributions (randint ranges)
# - n_iter=1000: number of random hyperparameter combinations to sample
# - cv=5: 5-fold cross-validation
# - scoring='f1': use binary F1 score (harmonic mean of precision & recall)
#                 F1 is especially useful when the dataset is imbalanced
# - n_jobs=-1: use all CPU cores for faster computation
# - random_state=42: reproducible results
rscv_f1 = RandomizedSearchCV(
    dtc, grids_dtc, n_iter=1000, cv=5,
    scoring='f1', n_jobs=-1, random_state=42
)


# Fit the RandomizedSearchCV to the dataset
# Fit the model on the full breast cancer dataset.
# RandomizedSearchCV will:
#   - Randomly sample 1000 hyperparameter sets
#   - Train + validate using 5-fold CV
#   - Compute F1 score for each combination
#   - Select the hyperparameters with the highest mean F1 score
rscv_f1.fit(X_bc, y_bc)



# Print the best hyperparameters
# The attribute .best_params_ gives the hyperparameters that achieved
# the best mean F1 score across all folds.
print("Breast Cancer: Best hyperparameters for F1 score:", rscv_f1.best_params_)


Breast Cancer: Best hyperparameters for F1 score: {'max_depth': 38, 'max_leaf_nodes': 110, 'min_samples_leaf': 8, 'min_samples_split': 28}
